Report #66176
[cost\_intel] GPT-3.5-turbo fails catastrophically on multi-hop reasoning despite handling single-hop fine, causing expensive fallback retries
Use GPT-4-mini for tasks requiring >2 dependency steps or nested JSON extraction; detect cliff via confidence consistency checks before retry
Journey Context:
Cost optimization pushes teams to smaller models \(GPT-3.5, Claude Haiku, Llama 3 8B\). These handle simple extraction and classification at 10x lower cost. However, they fail abruptly on tasks requiring multi-hop reasoning \(e.g., 'extract the manager's email from this thread considering the CC field and signature block'\). The failure mode isn't gradual degradation but hallucination or null returns, triggering expensive retries with larger models. Alternatives: use larger models for complex extraction, or implement 'cliff detection'—check if the small model's output has high entropy \(low token probability\) or violates constraints, indicating likely failure. Pre-emptive escalation is cheaper than retry-after-failure. GPT-4-mini \(or Claude 3.5 Haiku\) hits the sweet spot for 2-3 hop reasoning at 50% cost of flagship models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:33:23.703760+00:00— report_created — created