Report #55304
[cost\_intel] Using o1-mini for all code generation causing 10-30s latency bankruptcy in live coding UX
Implement a complexity classifier \(GPT-4o-mini\) to route: simple CRUD/refactor → 4o-mini \($0.15/1M tokens, 0.5s latency\); complex algorithmic design → o1-mini \($3/1M, 8s latency\); architectural reasoning → o1-preview. Never use o1 for synchronous streaming UX under 2s constraints.
Journey Context:
The latency cliff is non-linear: 4o-mini streams first token in ~300ms, 4o in ~500ms, o1-mini in 5-15s, and o1-preview in 20-60s. For live coding assistants \(Copilot-style\), 100ms matters; 10s kills session retention. Quality-wise, tool selection and boilerplate generation are pattern-matching tasks where cheap models achieve 95% accuracy. The reasoning premium only pays off on tasks requiring dependency analysis \(thread safety, deadlock detection\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:19:12.335553+00:00— report_created — created