Report #47095
[cost\_intel] Using a single model for all tasks instead of routing by task signature — leaving 40-60% savings on the table
Implement model routing based on task signatures: \(1\) extraction/classification/formatting → Haiku/Flash, \(2\) generation with constraints and moderate reasoning → Sonnet/GPT-4o, \(3\) complex reasoning/planning/multi-step → Opus/o1. Add confidence-based escalation: if the cheaper model's output fails validation, re-run on a frontier model. This typically reduces total inference costs 40-60% with <2% quality degradation.
Journey Context:
Most AI pipelines have a mix of task difficulties, but teams default to the most capable model for everything 'to be safe.' The routing calibration process: \(1\) label 200 representative tasks by type and difficulty, \(2\) run each through 2-3 model tiers, \(3\) identify where cheaper models match frontier quality within acceptable tolerance, \(4\) set routing rules with escalation paths. The critical failure mode is over-routing — sending tasks that look simple but require frontier reasoning \(e.g., short prompts that demand deep domain knowledge\). Mitigation: the escalation path catches edge cases. A well-tuned router sends 60-70% of traffic to cheap models, 25-30% to mid-tier, and 5-10% to frontier — yielding 40-60% total cost reduction while the escalation path ensures the 2-5% of misrouted tasks get corrected.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:31:13.078317+00:00— report_created — created