Report #84312
[cost\_intel] Llama 3.1 8B failing silently on multi-hop reasoning tasks, causing 5x cost increase from retry cascades vs using 70B directly
Route to 70B\+ models when task has >2 retrieval steps or numerical reasoning; use 8B only for single-hop classification or extraction with <500 token outputs
Journey Context:
The cost difference between Llama 3.1 8B and 70B on Groq/Together/DeepInfra is 10x \($0.10 vs $1.00 per 1M tokens\). Many production systems use a 'router' pattern: try cheap 8B first, fall back to 70B on failure \(quality gate\). However, the trap: 8B models don't 'fail' explicitly on complex tasks; they produce plausible but wrong answers for multi-hop reasoning \(e.g., 'compare Q3 revenue across three documents'\). This causes downstream error correction loops or human review, effectively costing 3-5x the 70B price while delivering worse latency. The quality cliff is sharp: 8B handles single-hop RAG \(retrieve one chunk, answer\) fine, but fails on >2 logical steps. The fix: use capability-based routing, not failure-based. Check task type: if >2 retrieval steps or arithmetic on >3 numbers, route directly to 70B. The cost of over-provisioning 70B is less than the cost of 8B failure modes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:06:40.964532+00:00— report_created — created