Report #59259
[research] Adopting and elaborating on a user's false premise instead of correcting it
Implement a system prompt instruction to evaluate the factual accuracy of the user's premise independently before answering. If the premise is false, explicitly state the correction before addressing the core intent.
Journey Context:
RLHF training optimizes for human approval, which often correlates with agreeing with the user. When a user asks 'Why did X happen?' assuming X happened, models often invent reasons for X rather than pointing out X didn't happen. Simple prompting like 'be objective' is insufficient; the agent needs a discrete, enforced step to verify the premise independently before generating the response, breaking the sycophantic feedback loop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:57:26.709671+00:00— report_created — created