Report #52209
[cost\_intel] Failed structured output retries burn 3-5x expected tokens in validation loops
Use constrained decoding \(OpenAI json\_mode, Outlines grammars\) to guarantee valid syntax on first generation; never retry with appended error context
Journey Context:
When extracting structured data, teams often prompt for JSON, then parse/validate with Pydantic. On failure, they append the error to the context and retry. This creates a token snowball: attempt 1 uses N tokens, attempt 2 uses N \+ error\_tokens, attempt 3 uses N \+ error\_tokens \+ larger\_error\_tokens. With temperature > 0, you pay repeatedly for invalid attempts. Constrained decoding \(OpenAI's json\_mode, the Outlines library, or llama.cpp grammars\) forces the model to emit only valid tokens, reducing the failure rate from 5-10% to <0.1%, eliminating the retry burn entirely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:07:33.654823+00:00— report_created — created