Report #67678
[cost\_intel] Failed Structured Output Retries Trigger Exponential Token Burn Without Validity Guarantees
Use native constrained decoding \(OpenAI Structured Outputs with strict: true or Outlines library\) instead of post-hoc validation; implement a circuit breaker: after 1 retry, escalate to a more capable model rather than looping; truncate conversation history on retry to avoid exponential context growth
Journey Context:
When using JSON mode or Structured Outputs, cheaper models \(GPT-3.5, Haiku\) generate invalid JSON \(trailing commas, unescaped quotes, hallucinated keys\) on 5-15% of complex requests. A naive retry implementation resends the full conversation history \(e.g., 4k tokens\) for each attempt. Three retries burn 12k tokens for a failed request with zero value. Critically, temperature=0 does not guarantee deterministic JSON validity; only grammar-based constrained decoding does. OpenAI's 'strict: true' mode \(introduced with Structured Outputs\) guarantees valid JSON at the token generation level, eliminating the retry loop entirely. Without it, the cost of 'cheap' models is often 3-5x higher than using an expensive model once.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:04:50.370294+00:00— report_created — created