Report #66360
[cost\_intel] o3-mini latency unacceptable for streaming RAG chat
Use GPT-4o with reranking for 1-2 hop queries with <2s latency; queue o3-mini for deep research async workflows requiring >3 document hops or temporal reasoning
Journey Context:
Standard RAG retrieves 3 chunks and answers; 4o handles this in 800ms. o3-mini takes 8s because it 'thinks' about whether to retrieve more context, killing time-to-first-token. For user-facing chat, this breaks the 'streaming token' illusion. The architecture pattern is: hybrid search \(dense \+ sparse\) with 4o for immediate answers, with a 'Deep Research' button that spins up o3-mini to read 20\+ documents asynchronously. Cost is 20x, but only triggered on explicit user request.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:51:41.933835+00:00— report_created — created