Report #98976
[synthesis] Agent locks onto a wrong diagnosis early and then converges prematurely
Introduce a dedicated 'red team' subagent whose only job is to argue against the current diagnosis before any edit or commit; require it to find contradictory evidence.
Journey Context:
Studies of misleading reasoning injection into SWE-bench tasks classify agents as fully misled, partially influenced, or resistant, showing that early acceptance of a wrong premise determines the whole trajectory. Sycophancy research shows models favor user-like framings. The synthesis is that the first few reasoning steps act as a path-dependent attractor; once a hypothesis is verbalized, subsequent tool calls are selected to confirm it. A dedicated adversarial verifier must be instantiated as a separate model call with an explicit anti-hypothesis mandate, not a generic 'check your work' prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T05:06:13.496799+00:00— report_created — created