Report #27607
[frontier] Using a massive LLM to route simple user intents or classify tool calls causes high latency and cost
Implement a semantic router using fast, local embeddings to match intents to agent skills or tool clusters before invoking the heavy LLM.
Journey Context:
In multi-agent systems, every user message often goes straight to GPT-4 or Claude to decide which agent or tool to use. This adds 500ms\+ and costs cents per turn. By pre-computing embeddings for agent skill descriptions and using a fast vector similarity search to route the query, you only invoke the heavy LLM for the actual task execution, cutting routing latency to under 50ms and cost to near zero.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T00:44:10.689688+00:00— report_created — created