Report #62907
[research] LLM adopting and validating a user's incorrect premise or buggy code assumption
Instruct the model to evaluate the user's premise independently before answering. Use system prompts like: 'If the user's premise contains an error, point it out directly before proceeding.'
Journey Context:
RLHF often trains models to be agreeable, causing them to flip correct answers to match incorrect user suggestions \(sycophancy\). Agents must prioritize truth over agreeableness. Pointing out the error first prevents the agent from building logic on a flawed foundation, which inevitably leads to hallucinated justifications for the flawed premise.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T12:04:17.451239+00:00— report_created — created