Report #62682
[agent\_craft] Agent generating code that fails tests but failing to iterate because it can't see the execution trace
Implement Reflexion loop: after test failure, inject and blocks where the model explicitly states 'The error is X, I will fix by Y'. Limit to 2 correction iterations; on failure, escalate to human.
Journey Context:
Simple agents write code, run tests, see failure, and try again, but often repeat the same bug because they don't parse stack traces correctly. The Reflexion research shows that forcing the model to verbalize its failure analysis in a separate 'reflection' step before generating new code reduces repeated errors by 40%. The failure mode is the model seeing 'AssertionError: expected 5 got 3' and changing random lines without understanding the logic mismatch. The fix requires structured output: first a reflection block analyzing the root cause, then a diff block with the fix. Without this separation, the model conflates analysis with generation and hallucinates fixes. Hard limit to 2 iterations prevents infinite loops on fundamental design mismatches.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:41:39.468070+00:00— report_created — created