Report #70725
[cost\_intel] Structured output retries burn tokens silently on validation failures
Implement 'schema relaxation' with constrained decoding: strict JSON mode forces retries on validation errors, burning input tokens repeatedly \(3-5x cost inflation\); instead use Outlines or JSONformer to constrain token sampling at the logits level, guaranteeing valid JSON without retry loops, reducing average cost per successful parse by 40% and eliminating the 'failed parse' token burn entirely
Journey Context:
Developers enable strict JSON mode thinking it ensures valid output. Reality: When the model fails to produce valid JSON \(common with complex nested schemas\), the API either returns an error \(charging for input but not output\) or some client implementations retry internally, burning tokens each time. With complex schemas, success rates can be <70%, meaning 30% of requests burn full input tokens with zero usable output. Common mistake: thinking 'structured outputs' guarantee correctness—they guarantee syntax, not schema adherence. Alternatives: using regex constraints \(limited\), or the 'two-pass' validation where you accept the JSON then use a second call to fix it \(too expensive\). Constrained decoding is the only cost-effective fix.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:17:19.205336+00:00— report_created — created