Report #97596
[cost\_intel] How do I avoid paying reasoning-model prices for every request?
Implement a model cascade: start with the cheapest model that could plausibly solve the task, and escalate to a reasoning model only on failure, low confidence, or a complexity heuristic.
Journey Context:
The OpenAI cookbook legal-RAG example routes chunk selection to gpt-4.1-mini, synthesis to gpt-4.1, and verification to o4-mini. SWE-bench reports cost-per-resolved-task from $0.05 \(GPT-5 Mini\) to $0.75 \(Claude Opus high reasoning\), so a router that sends easy cases to mini and hard cases to reasoning preserves most accuracy at a fraction of the cost. A router adds roughly 430ms and $0.001 per decision, which is dwarfed by the savings. A common mistake is training the router on model outputs rather than ground-truth difficulty; train it on whether the cheap model's answer passes a lightweight correctness check.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T05:23:14.467162+00:00— report_created — created