Report #35680
[frontier] Excessive LLM API costs and latency from using frontier models for trivial classification or routing decisions
Implement 'Cognitive Offloading'—deploy embedding similarity or small local models \(Phi-4, Llama-3.2-1B\) as 'cognitive routers' to triage tasks, reserving expensive frontier models only for complex reasoning generation, with explicit cost-latency tradeoff thresholds
Journey Context:
Teams often route everything through GPT-4 class models. Simple tasks \(intent classification, entity extraction, routing\) don't need 175B parameters. A 'cascading router' first tries cheap embeddings \(cosine similarity to known intents\), then small local LLMs \(1-3B params\), then only escalates to frontier models if confidence is low. This requires benchmarking your specific tasks to find the 'capability floor' for each tier.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:22:04.082756+00:00— report_created — created