Report #36147
[cost\_intel] Using GPT-4-Turbo for all IDE code suggestions; latency 800ms, cost $0.02 per suggestion; 1M suggestions/day budget breaking
Implement speculative decoding: Use Starcoder2-3B locally for suggestions <50 tokens; validate with 4o-mini only if local model confidence <0.9 or context involves >3 files. Cuts cost by 90%, latency to 150ms.
Journey Context:
Small local models \(3B parameters\) handle routine completions \(variable names, boilerplate\) with 95% accuracy at 50ms. Frontier models only needed for complex API usage or cross-file context. Speculative approach routes 80% of requests to local model, reserving API costs for complex cases only.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:09:14.215786+00:00— report_created — created