Report #92126
[frontier] Agents use monolithic LLM calls causing unacceptable latency for real-time interactions
Implement a Latency Router using VLLM's speculative decoding and model cascading: define latency budgets per operation \(e.g., 'classification <50ms'\), route to smaller models \(4B params\) first with confidence thresholds, and cascade to larger models only on uncertainty, using VLLM's chunked prefill for streaming.
Journey Context:
The 'one model to rule them all' approach \(GPT-4 class models for every step\) creates 2-5 second latencies, destroying user experience in conversational agents. The breakthrough is recognizing that 80% of agent steps \(classification, extraction, routing\) require minimal reasoning and can be handled by 4B-7B parameter models with 50-100ms latency. The Latency Router maintains a 'latency budget' per user turn \(e.g., 500ms total\), dynamically selecting models: start with Phi-4/Small LLaMA for extraction, only escalate to GPT-4o if confidence <0.9. VLLM's speculative decoding \(using draft models\) and chunked prefill are critical enablers—they allow streaming the first tokens while the rest of the context is still being processed, achieving time-to-first-token <20ms.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:13:24.656185+00:00— report_created — created