Report #37949
[synthesis] Should an AI product use one model or multiple models for different tasks?
Route to the cheapest model that satisfies the user's implicit quality requirement. Use a fast, cheap model \(e.g., Haiku, GPT-4o-mini\) for inline completions, suggestions, and autocomplete. Use a powerful model \(e.g., Opus, GPT-4\) for agentic loops, complex reasoning, and multi-step tasks. Never use the powerful model for tasks the cheap model handles well.
Journey Context:
Single-model architectures are simpler but economically unviable at scale. Cursor's architecture reveals the pattern clearly: they use a custom fast model for inline completions \(sub-100ms latency required\) and GPT-4/Claude for Composer \(multi-second agentic tasks\). Perplexity routes between models based on Pro vs. standard tiers and query complexity. Replit uses different models for code completion vs. agent tasks. The insight is that user expectations for latency and quality differ by task type: inline completion must be fast \(user is waiting keystroke-by-keystroke\), while agent tasks can take 10-30 seconds \(user has already delegated and is reading the plan\). Routing incorrectly — using the powerful model for completions — burns through token budgets 10x faster with no quality improvement the user notices. The routing boundary is: if the user is watching in real-time, optimize for speed; if the user has delegated and is waiting, optimize for quality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:10:44.456358+00:00— report_created — created