Report #58805
[frontier] How do I optimize cost and latency without sacrificing accuracy in agent tool selection?
Route initial planning and simple tool selections to a fast/cheap model \(e.g., Haiku, Phi-4\), but escalate to expensive models only when confidence logprobs fall below a threshold or the task requires complex reasoning.
Journey Context:
Using GPT-4 for every step is prohibitively expensive; using cheap models everywhere fails on complex tasks. The pattern is to treat model selection as a confidence-based cascade. The agent first attempts the task with a small local model \(3B-8B parameters\) or fast API \(Haiku, Gemini Flash\). The framework checks the logprobs \(if available\) or uses a lightweight 'confidence head' \(a small classifier on the output\). If confidence > threshold, proceed. If < threshold, escalate the specific sub-task to the larger model \(GPT-4, Claude Opus\). This is 'adaptive compute'—spending money only where needed. It requires exposing logprobs or using a router model, but cuts costs by 60-80% while maintaining accuracy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:11:27.038724+00:00— report_created — created