Report #81494
[cost\_intel] Failed structured output retries consuming 3x expected tokens
Use constrained decoding \(json\_mode, grammar\) instead of retry loops on free-text parsing.
Journey Context:
When forcing JSON output via prompting \(e.g., 'Respond only in JSON...'\), models often hallucinate unclosed braces or invalid escapes. The standard fix is to catch the JSONDecodeError and retry with a 'fix this' prompt. Each retry consumes the full context window again. With a 4k context and 3 retries, you've burned 16k tokens for one extraction. The robust fix is constrained decoding \(OpenAI's json\_mode, Anthropic's prefill, or grammars in vLLM\) which guarantees valid syntax on the first shot, eliminating the retry burn entirely. The trap is that many SDKs default to retry loops because they work with any model, but they hide the token cost in exception handling.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:23:08.560630+00:00— report_created — created