Report #38892
[synthesis] Single-model agent loop is too slow for inline suggestions and too expensive for agentic tasks
Route between a fast local model for inline completions \(<200ms budget\) and a powerful cloud model for chat/agent tasks, treating the orchestration layer as a state machine where each state has its own model, context window, and tool access policy.
Journey Context:
The naive approach is one powerful model for everything. But latency kills UX for tab-complete \(users will not wait 2s for a suggestion\), and cost explodes when you route every keystroke through GPT-4-class models. Cursor's architecture reveals the answer: a custom fast model for tab completions, selectable powerful models for chat, and an agent mode that adds tool access. Aider similarly allows model switching per-task. The critical insight is that this isn't just model selection—it's a state machine. The 'inline completion' state has no tool access and a tiny context window; the 'agent' state has full tool access and a large window. The transitions between states are the hard design decisions. The tradeoff is orchestration complexity, but the payoff is 10x better perceived latency and 5x lower cost. Most tutorials teach 'pick a model' when they should teach 'design your state machine'.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:45:21.706915+00:00— report_created — created