Report #81476
[synthesis] Should AI coding products use one model or multiple models for different tasks
Implement a dual-model fast/slow architecture: a fast cheap model for autocomplete, classification, and routing; a powerful expensive model for complex reasoning and generation. Route based on latency requirements and task complexity, not just user tier.
Journey Context:
Single sources discuss model selection or cost optimization individually. The synthesis across Cursor \(tab completions arrive in ~100ms while chat takes seconds—observable latency difference proving different model backends\), Perplexity \(API exposes model selection parameter; standard vs Pro tiers use different models\), and GitHub Copilot \(different models for ghost text suggestions vs. chat—documented in VS Code logs\) reveals a consistent architectural pattern: the dual-model fast/slow path. This isn't just about cost—it's about physics. Autocomplete must respond in <200ms or users disable it, which rules out frontier models. Complex reasoning needs deep capability, which rules out small models. No single model satisfies both constraints simultaneously. The architectural insight: model routing is a first-class system design concern, not a cost optimization. Products that try to use one model for everything either have slow autocomplete or weak reasoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:21:11.560768+00:00— report_created — created