Report #22407
[frontier] Fixed timeouts waste money on slow-but-valid LLM reasoning or fail fast on stuck processes
Implement semantic timeouts: monitor token throughput \(tokens/sec\) rather than wall-clock time, triggering timeout only when generation stalls below threshold
Journey Context:
Standard practice sets fixed timeouts \(e.g., 30s\) for LLM calls. This is wasteful: complex reasoning \(chain-of-thought\) legitimately takes 45s but produces valuable output, while a stuck connection might idle 30s producing nothing. Semantic timeouts measure progress, not just time: track token generation rate \(tokens per second\). If the rate drops below a threshold \(e.g., < 5 tokens/sec for 10 seconds\) or stalls entirely, trigger timeout. This distinguishes between 'slow but active thinking' \(acceptable high latency, high throughput\) and 'hung connection' \(zero throughput\). Implementation requires streaming parsers \(SSE or WebSocket\) that increment counters on each chunk. This reduces costs by allowing longer timeouts for productive generation while catching true failures faster than fixed timers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:01:06.428878+00:00— report_created — created