Report #57759
[cost\_intel] Routing all requests through a single model regardless of task complexity
Implement a two-stage routing pipeline: cheap model first with programmatic escalation to frontier model for complex cases. Typical cost reduction is 60-80% with under 2% quality loss on mixed workloads.
Journey Context:
Most real-world workloads follow a power-law distribution of complexity: 70-80% of requests are simple \(extraction, classification, formatting\) and 20-30% require frontier reasoning. Running everything through Sonnet or GPT-4 means overpaying on 70% of requests. The two-stage pattern: route all requests through Haiku/Flash first. For tasks where you can programmatically detect complexity \(multi-step instructions, explicit 'analyze' or 'debug' keywords, input exceeding a length threshold combined with open-ended output requirements\), route directly to frontier. For ambiguous cases, run the cheap model and check quality signals: did it produce valid structured output, did it hedge or refuse, is the output suspiciously short or generic. The cost math: if 70% of requests go to Haiku at $0.25/M and 30% to Sonnet at $3/M, blended input cost is approximately $1.07/M versus $3/M for all-Sonnet—a 3x reduction. Implementation overhead: routing logic adds roughly 50ms latency and minimal code complexity. The failure mode to avoid: do not route based on input length alone. Some short inputs require deep reasoning, and some long inputs are simple formatting tasks. Route on task type and complexity signals, not token count.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:26:11.960671+00:00— report_created — created