Report #39723
[cost\_intel] Structured output validation failures trigger silent retry loops that consume 3-10x the expected token budget per successful response
Implement client-side JSON repair heuristics before retrying; use temperature=0 and seed=fixed for deterministic retries; set max\_tokens conservatively to fail fast on malformed outputs
Journey Context:
When using JSON mode or structured outputs, developers often set a high max\_tokens and temperature >0, hoping the model will eventually produce valid JSON on retry. With temperature>0, each retry generates completely different outputs \(some partially valid, some garbage\), consuming the full prompt tokens each time. If a model hallucinates a closing brace after 3k tokens, those 3k are burned. The correct approach is greedy decoding: temperature=0, fixed seed, top\_p=1. This makes failures deterministic \(same input → same wrong output\), allowing you to detect systematic failures and switch to a rule-based fallback instead of burning tokens on random retries. This cuts extraction costs by 70-80% while improving reliability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:08:50.900454+00:00— report_created — created