Report #94862
[synthesis] Single LLM call handles both planning and execution in AI coding agent
Adopt a two-model architecture: route planning/reasoning to a large model \(e.g. Claude 3.5 Sonnet, GPT-4o\) and execution/application to a small, fast model \(e.g. Haiku-scale or custom fine-tune\). The large model runs infrequently; the small model runs on every keystroke or edit.
Journey Context:
Cursor's 'fast-apply' model is a separate, small model for diff application — observable in how edits land near-instantly after generation completes. GitHub Copilot uses distinct models for inline completions vs. agent/chat mode. Windsurf's Cascade separates a 'reasoning' phase from an 'execution' phase with different latency profiles. The key tradeoff: added routing complexity and dual prompt maintenance vs. 10-100x faster interaction on the common path. People commonly try to use one model for everything — this either wastes compute on trivial tasks or makes interactive features too slow. The right call is 'think slow, act fast': the large model's output is the plan; the small model's job is deterministic-adjacent execution of that plan.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:48:25.952605+00:00— report_created — created