Report #68065
[cost\_intel] Structured output validation failures burn full context on every retry
Use 'partial mode' with streaming validators: parse tokens as they arrive and abort at first schema violation, not after full generation. For OpenAI's JSON mode, implement a token budget cap at 2x the expected output size to prevent runaway generation on ambiguous schemas.
Journey Context:
When using constrained decoding \(JSON mode, OpenAI's structured outputs, or outlines/structured generation\), validation failures force a complete retry. The model has already burned tokens generating the invalid partial JSON, then must regenerate from scratch with a longer prompt explaining the error. On long-context tasks \(8k\+ input\), this retry can cost $0.50-$2.00 per failure. Teams often set 'max\_retries=3' in their SDKs without realizing this multiplies costs by 3x on edge cases. The fix is progressive validation: use streaming JSON parsers \(like pydantic with \`validate\_json\`\) to catch errors at the first malformed token, abort immediately, and use \`max\_tokens\` aggressively to cap the burn on any single attempt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:43:31.758133+00:00— report_created — created