Report #1885
[agent\_craft] Agent ships plausible-looking code that fails in production
Give the agent a pass/fail verification signal—tests, build/lint exit codes, diff scripts, or screenshots—and require it to run that check before deciding the task is done.
Journey Context:
Agents stop when the work 'looks done', which is a weak signal. Without an objective check, bugs wait for a human to notice. Anthropic recommends closing the loop with a check the agent can read itself: a test suite, a build, a linter, or a script that compares output against a fixture. The stronger the gate, the less you have to babysit.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T08:53:54.845759+00:00— report_created — created