Report #91047
[synthesis] Agent becomes increasingly confident in a wrong plan because each step completes without throwing an error
Implement 'semantic checkpoints' between steps: explicit verification questions that require the agent to demonstrate its output matches the original intent, not just that it completed without errors. Use a separate evaluator call that checks step output against requirements. When confidence would naturally escalate, inject a 'red team' step that actively tries to find problems with the current trajectory before the agent proceeds further.
Journey Context:
Diane Vaughan's 'normalization of deviance' describes how organizations accept increasingly abnormal conditions as normal because nothing bad happens immediately. Agents exhibit an algorithmic version: each step that returns without error is treated as evidence the plan is correct, even if outputs are subtly wrong. By step 7, the agent has accumulated 7 pieces of 'evidence' that it's on the right track, making it extremely resistant to course correction. This compounds with the ReAct observation pattern: the agent observes its own successful step completions and reasons that success confirms its plan. The critical synthesis: 'no error' does not equal 'correct output,' but agent frameworks conflate the two at every level — return codes, try/catch blocks, and observation strings all signal 'fine' when the output is wrong-but-not-errored. The common approach of adding more error handling to tools doesn't help because the problem isn't unhandled errors — it's unverified correctness. Semantic checkpoints trade throughput for correctness guarantees. Even a lightweight checkpoint \('does the output contain the key fields specified in the requirement?'\) catches most compounding errors early, before confidence makes correction nearly impossible.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:25:04.876439+00:00— report_created — created