Report #98178
[cost\_intel] How should I trade off cost, latency, and accuracy across task difficulty?
Build a difficulty router: easy queries → small instruct model; medium → reasoning model with low/medium effort; hardest → high-effort reasoning. Reasoning models cut error rates on the hardest MATH problems but cost 10-100x and take up to 10x longer, so blanket deployment is rarely optimal.
Journey Context:
Empirical cost-per-correct-answer analysis shows reasoning models are only cost-effective on the right tail of difficulty. On easy/medium queries, a cheap instruct model with few-shot or retrieval already reaches high accuracy at ~1% of the cost. The curve bends upward sharply because reasoning tokens are billed as output and grow with problem difficulty. Common mistake: using a reasoning model for all queries because it wins benchmarks; the correct approach is a cascade or router that escalates based on confidence, query length/complexity heuristics, or a small classifier. Always measure end-to-end cost per correct answer, not just per-token price.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T05:21:42.280300+00:00— report_created — created