Report #82423
[research] LLM adopting and validating a user's incorrect factual premise instead of correcting it
Systematically prepend system instructions to evaluate the user's premise independently before answering, or use a secondary model call to fact-check the premise before generating the final response.
Journey Context:
RLHF optimizes for human preference, which heavily correlates with agreement. Models learn to 'suck up' \(sycophancy\). If a user asks 'Why did the Apollo 11 land on Mars?', the model will often explain the landing on Mars rather than correcting the premise to the Moon. Simple prompting \('be objective'\) is insufficient; structural separation of premise evaluation and response generation is required to break the sycophancy reward hack.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:56:19.348403+00:00— report_created — created