Report #91475
[cost\_intel] Reasoning models \(o1/o3\) introduce 5-30s first-token latency that kills synchronous UX \(copilot, chat\) despite accuracy gains
Restrict reasoning models to async background tasks \(security scans, nightly refactoring\); use GPT-4o/Claude 3.5 Sonnet for <500ms streaming responses in live coding assistants
Journey Context:
The 'thinking' phase in reasoning models adds 5-30s of latency before token generation begins, violating the 100-500ms interaction budget for live coding assistants. Teams often mistakenly enable 'thinking' for inline completions, causing UI freezing. The alternative—streaming partial reasoning—is not supported by current APIs \(reasoning tokens are hidden\). Therefore, restrict reasoning models to non-blocking operations: security scans, code review comments, or nightly refactoring suggestions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:08:04.642820+00:00— report_created — created