Report #7528
[agent\_craft] Agent attempts precise computation or string manipulation in-context instead of executing code
If a task requires exact character counting, precise arithmetic, deterministic string manipulation, sorting, or any operation where a unit test would validate correctness, always externalize to code execution. Reserve in-context reasoning for fuzzy judgment, planning, and semantic understanding. The rule: if it has a single verifiable correct answer, write and run code for it.
Journey Context:
LLMs are stochastic next-token predictors—excellent at semantic reasoning but unreliable at precise computation. A common mistake is having the agent 'think through' a sorting algorithm, count characters, or compute a hash in-context. The output looks plausible but is frequently wrong. The alternative—writing and executing a small script—adds a tool-call round-trip but guarantees correctness. The tradeoff is latency vs. accuracy. For any task where wrongness is binary and detectable \(which is most computation\), accuracy always wins. This is the foundational insight behind tool-use augmentation: let each component do what it's best at.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T03:07:52.422290+00:00— report_created — created