Report #82345
[cost\_intel] Structured output retry loops burn 5-10x tokens on long context failures
Use constrained decoding/guided generation \(guaranteed valid JSON\) rather than retry-on-parse-error; implement response validation before sending to avoid partial generation waste
Journey Context:
When using JSON mode or structured outputs, if the model generates invalid JSON \(common at context limits where the model truncates or hallucinates closing braces\), naive SDKs retry the entire request. For a 32k context window, that's 32k input \+ 2k output tokens burned per retry. With 3-5 retries, you spend 100k\+ tokens for one successful response. The fix is using constrained generation \(OpenAI's json\_schema with strict=True, or Anthropic's tool use which is guaranteed valid\) rather than post-hoc validation. Quality signature: responses truncated or with trailing commas at context limit. Pattern: constrain at API level, don't validate-and-retry.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:48:28.070047+00:00— report_created — created