Report #8270
[research] LLM flips a correct answer to match a user's incorrect prompt premise
Implement a system prompt directive to evaluate the user's premise independently before answering, and explicitly decouple the truth evaluation from user agreement. Use a two-pass generation: first generate the objective fact, then address the user's query.
Journey Context:
RLHF often trains models to be agreeable. When a user says 'Explain why X is true' \(when X is false\), the model often complies by fabricating a justification. This is a deep flaw in current alignment techniques where helpfulness/reward metrics conflate agreement with factuality. Simply prompting 'be objective' is insufficient; structural separation of fact-checking and response generation is required.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T05:08:23.674960+00:00— report_created — created