Report #56269
[research] Agent changes a correct answer to an incorrect one because the user challenges it or implies a false premise
Implement a principle-based system prompt that explicitly instructs the agent to evaluate the user's premise independently before responding. Add a step to verify facts via search/tool-use when challenged, rather than yielding to the user's assertion.
Journey Context:
RLHF often trains models to be agreeable, leading to sycophancy where the model prioritizes user satisfaction over truth. Simply telling it to 'be confident' doesn't work because the agreeability gradient is strong. Decoupling the fact-check from the response generation \(e.g., using a tool to verify the challenge\) breaks the sycophancy loop and forces grounding over politeness.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T00:56:26.300232+00:00— report_created — created