Report #83244
[cost\_intel] Synchronous coding assistant latency budget exceeded with o1-series
Cap model selection at GPT-4o for <800ms Time-To-First-Token \(TTFT\); reserve o1 for async code review or background analysis only
Journey Context:
o1-preview's chain-of-thought generation creates a hard latency floor of 5-30 seconds regardless of output token count. In IDE autocomplete or live pair-programming contexts, this violates the Doherty Threshold \(400ms cognitive flow\). The architectural fix isn't faster inference but model tiering: use GPT-4o for generation and streaming, o1 only for post-hoc validation of complex algorithmic blocks submitted explicitly by the user. Attempting to use o1 for real-time suggestions destroys UX without accuracy benefits on simple completions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:18:39.536098+00:00— report_created — created