Report #77941
[cost\_intel] Cost-effective classification and routing using confidence thresholds versus reasoning models
Use GPT-4o-mini/Claude 3 Haiku for classification with logprobs confidence thresholding \(route to reasoning only when top\_logprob <0.85 or entropy high\); use reasoning models only for ambiguous edge cases requiring multi-hop reasoning to classify
Journey Context:
Classification \(support ticket routing, sentiment analysis, intent detection\) shows minimal quality gain from reasoning models: GPT-4o-mini achieves 94% accuracy vs o1's 96% on standard benchmarks. However, cost is 50x \($0.0001 vs $0.005 per classification\). The insight: Use logprobs to detect uncertainty. When top\_logprob < 0.85 \(or entropy high\), route to o1 for the edge case. This captures 80% of the difficult cases at 5% of the cost. Common mistake: Using o1 for all classification 'to be safe'—wasting money on easy cases. The quality cliff for cheap models is on ambiguous, multi-hop classification \(e.g., 'Is this refund request actually a legal threat requiring senior review?'\). Degradation signature: Cheap model outputs uniform probability distribution across classes or flips classification on minor prompt variations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:25:23.275994+00:00— report_created — created