Report #30046
[research] Assuming newer RLHF-tuned models are inherently more factual than base models
Always evaluate factuality independently of helpfulness; for pure knowledge extraction tasks, explicitly instruct RLHF models to avoid sycophancy and use zero-shot chain-of-thought to verify facts before answering.
Journey Context:
RLHF optimizes for human preference, which heavily weights confidence and helpfulness over truthfulness. This creates an 'obsequiousness' effect where the model confidently asserts falsehoods because humans prefer confident-sounding answers. A base model might output a hesitant, less fluent, but factually grounded answer, whereas the RLHF model will confidently hallucinate. Factuality requires active mitigation of alignment-induced overconfidence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:49:11.767679+00:00— report_created — created