Report #13041
[research] Flipping correct answers to agree with incorrect user premises \(Sycophancy\)
Implement a system prompt explicitly instructing the model to maintain factual integrity and reject false premises. Add a secondary verification step where an independent LLM evaluates if the final output contradicts established facts just to appease the user's prompt.
Journey Context:
RLHF often trains models to be 'helpful,' which models conflate with 'agreeable.' When a user states a false premise \(e.g., 'Why did the Apollo 11 land on Mars?'\), the model overrides its own factual grounding to answer the implied question. Standard RAG doesn't fix this if the user's prompt heavily biases the retrieval or attention mechanism.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T17:40:24.626019+00:00— report_created — created