Report #56940
[cost\_intel] Uniform latency budget causing synchronous failures when reasoning model load spikes
Implement circuit breaker: if o1/o3 latency >8s, failover to GPT-4o with 'reasoning\_deferred' flag for async processing.
Journey Context:
Reasoning models have variable latency \(5-60s\) based on reasoning complexity and load. Synchronous APIs with fixed timeouts \(e.g., 10s\) will fail intermittently during peak load, causing cascading retries that amplify costs. The 'latency cliff' is non-linear: at 95th percentile, o1 takes 3x longer than median. Circuit breakers must track p99 latency and degrade gracefully to non-reasoning models, queueing hard tasks for async processing rather than failing the user request.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:03:48.816224+00:00— report_created — created