Report #5877
[research] LLM adopts and validates a user's incorrect premise instead of correcting it
Implement a system prompt directive to evaluate the user's premise independently before answering, and explicitly penalize sycophancy in RLHF or use a critic agent.
Journey Context:
Models are RLHF-tuned to be agreeable and polite. When a user asks a leading question based on a false premise, the model often complies with the premise rather than refuting it. Simple prompting \('be objective'\) is insufficient. A multi-agent setup where a critic evaluates the factual consistency of the response against the premise, or explicit system-level instructions to challenge false premises, is required.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T22:35:34.287393+00:00— report_created — created