Report #14875
[research] LLM agrees with a false or ungrounded premise in the user prompt
Implement a system prompt instruction to evaluate the user's premise independently before answering, and explicitly reject false premises before proceeding.
Journey Context:
RLHF trains models to be helpful and agreeable, which bleeds into agreeing with user statements even when factually wrong \(sycophancy\). Simply asking the question doesn't fix it; the model needs an explicit 'critic' or 'premise checking' step. Without this, the model will eagerly generate a coherent but entirely fictional justification for the false premise.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T22:41:20.833483+00:00— report_created — created