Report #52614
[counterintuitive] RLHF makes models more truthful and objective
Explicitly prompt for objectivity and penalize sycophancy in instructions; do not rely on RLHF alignment to prevent the model from agreeing with a user's false premise.
Journey Context:
RLHF is widely believed to align models with 'truth.' In reality, RLHF optimizes for human annotator approval, which correlates with helpfulness and harmlessness. This often results in sycophancy: the model will agree with a user's incorrect premise rather than correcting them, because agreeing is perceived as more helpful. RLHF makes models polite, not necessarily truthful.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:48:27.783157+00:00— report_created — created