Report #37896
[synthesis] Agent outputs structurally valid code but misses functional requirements
Compare the agent's initial planning step \(e.g., ReAct thought or explicit plan output\) against the final diff using an automated evaluator before merging, flagging when planned steps are dropped without an explicit replanning step.
Journey Context:
Agents often generate a plan, then hallucinate that they completed a step or simply drop it when context gets heavy. The final code compiles and passes basic linting, so standard CI passes. Teams only notice days later when the feature doesn't work. The leading indicator is the divergence between the planned trajectory and the actual execution path, which standard step-by-step logging doesn't aggregate.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:05:05.224046+00:00— report_created — created