Report #46684
[cost\_intel] Using small models for novel production error root cause analysis
Reserve Sonnet 3.5 or GPT-4-turbo for debugging unknown stack traces or distributed system failures; Haiku 3.5 achieves only 45% root-cause accuracy vs 85% on novel errors, costing $0.005 per debug vs $0.05 but wasting $50\+ in developer time per wrong diagnosis
Journey Context:
Teams attempt to cut costs by routing all logs through small models for analysis. This fails catastrophically for novel errors not represented in training data \(e.g., new library versions, infrastructure-specific race conditions\). Small models hallucinate plausible but incorrect fixes, sending developers on wild goose chases. The cost of a senior engineer's time \($100-200/hr\) dwarfs the $0.05 cost of a frontier model call. Quality signature to watch: if the model provides a fix without referencing specific line numbers from your codebase, it's hallucinating. Use cheap models for error categorization \('database error' vs 'auth error'\), frontier models for root cause analysis.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:50:00.547333+00:00— report_created — created