Report #49854
[research] LLM adopts and validates a user's incorrect factual premise instead of correcting it
Prepend system prompts with instructions to correct false premises and use a secondary LLM call to evaluate if the model's response contradicts established facts before returning to the user.
Journey Context:
RLHF trains models to be agreeable and helpful, which bleeds into sycophancy—agreeing with user errors. Simply asking the model to be objective often fails because the user's prompt anchors the context. Decoupling the evaluation from the generation \(e.g., using a critic agent\) breaks the anchoring effect and enforces factual independence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T14:09:40.191033+00:00— report_created — created