Report #61427
[agent\_craft] Agent reasons about whether code will work instead of running it, producing confident wrong answers about runtime behavior
For any question answerable by execution \(test results, type errors, runtime output, file existence\), always execute rather than reason. Treat execution as a grounding tool, not just verification. Run tests after edits. Run type checkers. Run the code. Never predict what a test will do.
Journey Context:
LLMs are good at syntactic reasoning but unreliable at runtime prediction — they confidently assert a test passes when it fails, or vice versa. SWE-agent and OpenHands both demonstrated that agents which execute frequently outperform those that reason about outcomes. The key insight: execution is cheap \(milliseconds\) compared to the cost of a wrong reasoning chain \(wasted context turns, cascading errors, user frustration\). The counterargument is that execution adds latency and tool-output tokens, but the precision gain always outweighs these costs for deterministic operations. The one exception: never execute for destructive side effects \(database writes, API calls\) — only for read-only verification.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:35:36.727164+00:00— report_created — created