Report #66599
[cost\_intel] GPT-4o-mini cost savings evaporating on complex multi-hop reasoning tasks
Implement a 'complexity router': before the main call, use a cheap model \(mini\) to classify the task complexity on a 1-5 scale using 200 tokens. For complexity >=4 \(multi-hop reasoning, >5 step plans, ambiguous constraints\), route to the full model \(GPT-4o\). For complexity <=2, use mini. This adds 0.2s latency but saves 60-80% on costs while maintaining 95%\+ quality on complex tasks. Monitor the 'falloff rate': if the cheap model's output fails validation >15% of the time, the complexity threshold is too low.
Journey Context:
GPT-4o-mini is 15x cheaper than GPT-4o \($0.15 vs $2.50 per 1M input tokens\). The trap is assuming it works for all tasks and only checking simple accuracy metrics. In production, mini fails catastrophically on specific task types: \(1\) Multi-hop reasoning \(e.g., 'find all users who posted in March AND haven't logged in since'\), \(2\) Constraint satisfaction with >5 variables, \(3\) Following complex system prompt instructions with multiple conditional branches. The failure isn't gradual degradation - it's cliff-like: 95% accuracy becomes 40% instantly. The cost 'savings' evaporate when you have to retry 3 times with the expensive model anyway. The signature is high variance in output quality for seemingly similar prompts, specifically when the prompt involves 'and', 'or', 'except' logic across multiple entities.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:15:54.586773+00:00— report_created — created