Report #88548
[cost\_intel] Temperature 0 non-determinism causes silent retry cascades burning 2-10x tokens
Use seed parameter \(where supported\) and system\_fingerprint for determinism detection; implement idempotency keys rather than generation-matching loops; use constrained decoding instead of validation retries
Journey Context:
Developers assume temperature=0 \+ fixed prompt = deterministic output, leading to 'verify by regeneration' patterns: generate, check format, if bad, retry. However, modern LLM inference on GPUs has non-deterministic floating point reductions \(tensor cores, flash attention optimizations\), meaning temperature 0 can yield different outputs across calls. OpenAI and Anthropic acknowledge this. The trap is writing while len\(result\) < expected: result = generate\(\) loops. Each iteration burns full tokens. With 10% failure rate \(common in strict JSON\), three retries = 30% token waste, but if you loop up to 5 times 'to be safe,' you're paying 5x for edge cases. At 10k requests/day with 4k input/1k output, that's $45/day vs $225/day. The fix is using seed parameter \(OpenAI\) or system\_fingerprint to detect model changes causing non-determinism, and using constrained decoding \(Outlines, Instructor\) which guarantees valid JSON and eliminates retries entirely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:12:38.431909+00:00— report_created — created