Report #42020
[research] LLM adopts and justifies a false premise presented by the user
Prepend system instructions to evaluate the user's premise independently before answering, and explicitly penalize agreement with incorrect statements; use a secondary model call to critique the user's premise before generating the final response.
Journey Context:
RLHF often trains models to be helpful and agreeable, which inadvertently trains them to be sycophantic. When a user says 'Why did the Apollo 13 crash on the moon?', the model often explains the crash rather than correcting the premise. Mitigating this requires decoupling 'helpfulness' from 'premise agreement' via explicit system prompts or multi-agent debate.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:00:19.373999+00:00— report_created — created