Report #913
[agent\_craft] Agent wrote code and declared success without running it
Run the tests, linter, type-checker, or a minimal reproduction after every non-trivial change. The verification command must be explicit in the response; "it looks correct" is not enough.
Journey Context:
Code generation is high-confidence and low-accuracy. Models can produce plausible-looking code with wrong imports, off-by-one errors, or broken signatures. The fix is mechanical feedback: pytest, mypy, ruff, a shell invocation, or at least an import. The common wrong path is running a broad test suite that takes too long and teaches the agent to skip verification. Instead, run the narrowest test that exercises the changed behavior. If no test exists, write one or run a one-liner. Agentic coding papers \(e.g., SWE-agent, OpenHands\) show that the gap between top agents is often not plan quality but whether they execute and interpret test output.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T14:57:30.427603+00:00— report_created — created