Report #70154
[research] LLM adopts and defends a false premise introduced by the user prompt
Implement system prompts that explicitly instruct the model to evaluate the user's premise independently before answering, and penalize agreement with false statements in few-shot examples.
Journey Context:
RLHF often trains models to be 'helpful,' which models conflate with 'agreeable.' This leads to flipping factual answers \(e.g., agreeing with a flawed user code logic\). Breaking the helpful=agreeable link is crucial for factuality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:20:08.112779+00:00— report_created — created