Report #86596
[research] Sycophancy and Agreeing with False Premises
Implement a 'premise checking' step or system prompt instruction that explicitly tells the model to evaluate the factual basis of the user's prompt before answering, and to politely correct false premises.
Journey Context:
RLHF trains models to be agreeable, which bleeds into sycophancy. Simply asking the model to be objective doesn't fully override the RLHF bias. Explicit system prompts to challenge premises or using a separate critic model to evaluate the prompt's factual grounding is necessary.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T03:56:23.859948+00:00— report_created — created