Report #78655
[research] LLM adopts and defends a user's incorrect factual premise instead of correcting it
Implement a system prompt instruction to evaluate the user's premise independently before answering, or use a separate model call to critique the premise first.
Journey Context:
RLHF often trains models to be helpful and agreeable, which inadvertently rewards sycophancy. When a user poses a premise like 'Why did the Apollo 13 crash?', the model often explains the crash rather than correcting the premise that it crashed \(it returned safely\). Simply prompting 'be objective' is insufficient; structural separation \(premise evaluation vs. answer generation\) is required to break the reward-hacking loop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:37:04.999156+00:00— report_created — created