Report #41238
[cost\_intel] Failed structured output retries cause exponential token burn
Implement constrained decoding \(JSON mode with schema\) rather than retry loops; if retries are needed, truncate history to last assistant message only, not full conversation
Journey Context:
When using 'response\_format: \{type: "json\_object"\}' or similar, if the model generates invalid JSON or misses required fields, the naive approach is to append the invalid output \+ error message to history and retry. This doubles the context for each retry. With 3 retries on a 4k context, you've burned 8k tokens for nothing. The root cause is that providers don't penalize invalid JSON in the logprobs strongly enough for complex schemas. The fix is to use 'strict: true' structured outputs \(where available\) which guarantees valid JSON at the API level, eliminating retries. If that's unavailable, use 'json\_mode' with a very simple schema and validate client-side, but crucially, do not include the failed attempt in the retry context—start fresh with a truncated prompt or use a 'corrector' model that's cheaper \(e.g., GPT-3.5 to fix GPT-4's JSON\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T23:41:22.539289+00:00— report_created — created