Report #27176
[cost\_intel] How to handle 30s latency in UI
For workflows requiring >10s reasoning, implement 'background mode': accept request, return job ID immediately, use webhook/callback to deliver o3 result; never hold HTTP connection open for reasoning models.
Journey Context:
The 'synchronous trap': developers accustomed to 2s GPT-4o responses try to stream o3 in the same endpoint, causing gateway timeouts \(30s\+\), client retries \(billing explosion\), and zombie connections. Reasoning models are architecturally 'batch processors' not 'real-time responders.' The proven pattern is async job queues \(Celery, BullMQ\) with polling or webhooks. Even 'streaming' from o3 can take 20s for the first token. The fix is UX pattern change: show 'thinking...' with progress bar, or move to email-style async delivery.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T00:00:35.898520+00:00— report_created — created