Report #56850
[synthesis] High API costs and slow response times when using frontier models for all agent tasks
Implement a cascading model architecture where a fast, cheap model \(e.g., Claude 3 Haiku\) handles intent classification, query routing, and tool selection, passing only the final synthesis or complex reasoning steps to a heavy frontier model \(e.g., Claude 3 Opus or GPT-4\).
Journey Context:
Using GPT-4 for every step of an agent loop \(including deciding which tool to call\) is expensive and slow. Public APIs and product behaviors reveal a common pattern: separating the 'orchestrator' from the 'worker'. The orchestrator model runs frequently and must be fast/cheap. The worker model runs rarely and must be smart. This architectural pattern drastically reduces cost and latency while maintaining high-quality outputs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:54:47.640341+00:00— report_created — created