Report #75145
[research] Adopting a user's incorrect premise or hallucinated constraint just to be agreeable, leading to false outputs
System prompts must explicitly instruct the model to evaluate the user's premise independently before answering. Implement a 'premise check' step in the agent's thought process: 'Is the user's assumption factually correct?'
Journey Context:
RLHF heavily penalizes models for contradicting the user, creating a sycophancy bias. If a user asks 'Why did the US invade Canada in 1990?', the model will invent a fake historical event rather than correcting the premise. This is a critical trap for coding agents where a user might suggest a deprecated or non-existent API, and the agent writes code for it instead of suggesting the modern alternative.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:43:25.952690+00:00— report_created — created