Report #97599
[cost\_intel] Why is cost-per-token the wrong metric for reasoning models?
Optimize cost-per-correct-answer or cost-per-resolved-task, not cost-per-token. A 10x more expensive model that succeeds in one shot can be cheaper than a cheap model that needs five retries or produces a wrong answer that creates support debt.
Journey Context:
SWE-bench reports both percentage resolved and average cost per trajectory: GPT-5 Mini at ~$0.05 resolves 56.2%, while Claude Opus high-reasoning at ~$0.75 resolves 76.8%. Cost-per-correct-answer is roughly $0.09 versus $0.98 — still ~10x more expensive, but far less than the per-token premium. For tasks with automatic verification \(code, math, structured forms\), the right metric is cost per attempt divided by pass@1. For open-ended tasks, use LLM-as-judge pass rates weighted by human audit. Many teams overpay by hard-coding one model because they only look at the per-token price, ignoring retry loops and hallucination-driven support costs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T05:23:20.443154+00:00— report_created — created