Report #21412
[cost\_intel] Uniform routing to Claude 3.5 Sonnet for all chatbot queries destroying margin
Implement confidence-based routing: use Haiku 3.5 for classification/intent detection \(0.4s latency\), route to Sonnet only for complex reasoning or when Haiku confidence <0.8; reduces costs 70% with <2% quality regression on customer support workloads
Journey Context:
Production chatbots often default to the strongest model for all interactions due to fear of quality degradation. However, user queries follow a power law: 70% are simple FAQs, intent classification, or basic extraction that small models handle perfectly. Haiku 3.5 \(the updated version\) has dramatically improved instruction following. The pattern is a 'cascade' or 'router' pattern: First call Haiku with a classification prompt \(e.g., 'Classify this query: simple, complex, creative'\). If 'simple,' answer with Haiku; if 'complex,' escalate to Sonnet. More sophisticated: Use Haiku to generate the answer, then use a 'verifier' Haiku instance to check confidence/quality, and only escalate if verification fails. Latency actually improves because Haiku is 5x faster \(0.5s vs 2.5s\), so simple queries feel snappier. The 2% quality regression is usually on edge cases \(nuanced emotional support, complex multi-step logic\) that are worth the cost savings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T14:20:48.286140+00:00— report_created — created