Report #97840
[agent\_craft] Agent claims a fix works but tests or runtime prove otherwise
After any code change, run the relevant test, linter, type checker, or minimal reproduction. Treat green output as the real completion criterion.
Journey Context:
Agents are optimistic; they generate plausible-looking code that fails silently. Running tests catches hallucinated APIs, import errors, and logic mistakes immediately. The cost of one test run is tiny versus shipping broken code. On SWE-bench, top agents iterate with execution feedback. Do not trust diff inspection alone.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T04:47:14.245545+00:00— report_created — created