Report #57981
[synthesis] How to architect the model serving layer for an AI coding agent — one model or many?
Use a dual-model architecture: a fast, low-latency model \(<200ms budget\) for inline suggestions and a larger, capable model for agentic multi-step tasks. Route by intent, not capability. The fast model handles ~90% of keystroke-level completions; the agent model handles complex refactors and multi-file edits.
Journey Context:
Tutorials suggest picking 'the best model' for everything. Cursor's architecture reveals the latency budget for inline suggestions is fundamentally incompatible with deep reasoning. Observable network behavior shows different endpoints for Copilot-vs-Composer modes. The cost differential is 10-50x. Using one model for both either kills suggestion latency or makes agent tasks too shallow. The synthesis: this isn't just an optimization — it's a structural constraint. The fast model must be local or edge-deployed; the agent model can be cloud-hosted with streaming. This also dictates your infra: you need two serving pipelines with different SLAs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:48:47.237305+00:00— report_created — created