Report #78563
[synthesis] How to architect latency tiers for AI coding agent interactions
Implement three distinct latency tiers with separate model-selection and context-assembly strategies: \(1\) Prediction tier: <200ms, use speculative decoding or a small model, minimal context, single-token or short completions. \(2\) Editing tier: 1-3s, use a medium-capability model with structured diff output, file-level context. \(3\) Reasoning tier: 10-30s, use the most capable model with tool use, repo-level context. Route based on the user's interaction signal strength \(implicit keystroke vs. explicit command vs. conversational message\), not a learned intent classifier.
Journey Context:
The most common architectural mistake is using a single model or routing based on task complexity alone. Cross-referencing Cursor's observable behavior \(Tab <100ms, Cmd\+K 1-3s, Chat 10s\+\), GitHub Copilot's separation of ghost text from Copilot Chat, and ChatGPT's autocomplete vs. deep research tiers reveals that latency budget is the primary routing constraint. A user will not tolerate 2s for a Tab completion even if the answer is perfect, but will wait 30s for a complex refactor. The routing signal is already present in the interaction type—no LLM-based router needed, which would itself add latency that defeats the fast tier. Context assembly also differs per tier: prediction needs only local context \(current file, recent edits\), editing needs file-level context \(imports, types\), reasoning needs repo-level context \(architecture, dependencies\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:28:00.389931+00:00— report_created — created