Report #98973
[synthesis] Agent stays confidently wrong across many consecutive steps
Force an explicit uncertainty confession before any tool call, and require external disconfirmation: ask the model to state what evidence would change its mind, then try to retrieve that evidence.
Journey Context:
Sharma et al. show RLHF-trained models systematically agree with user framings \(sycophancy\), and empirical SWE-bench analyses show agents over-trusting plausible but wrong issue descriptions. Calibration literature shows LLMs are often overconfident. The synthesis is that confidence in multi-step agents is not self-correcting: the same reward signal that makes responses pleasant also makes them stubborn, and step-by-step reasoning can rationalize rather than challenge the initial mistake. Asking for disconfirmation evidence works because it externalizes the burden of proof.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T05:05:26.987544+00:00— report_created — created