Report #91063
[cost\_intel] Strict structured output JSON mode burns full context tokens on every retry after validation failure
Validate output with cheap model \(GPT-4o-mini\) before retrying with expensive model; implement truncated error feedback \(last 500 chars of error\) to reduce retry context size.
Journey Context:
When using \`response\_format: \{type: 'json\_object'\}\` or \`strict: true\` \(Zod schema\), if the model generates invalid JSON \(e.g., truncated due to max\_tokens, or hallucinated keys\), the API returns a 500 or parsing error. The common retry pattern resubmits the \*entire\* conversation history plus the error message. This burns the full input context \(which might be 8k tokens\) again, and the model regenerates the same flawed JSON, burning output tokens again. With 2-3 retries, a single request can consume 3x the expected tokens. The trap is assuming the retry is 'free' or that the error doesn't consume tokens—it does. The fix is to use a two-stage pipeline: use a cheap model \(GPT-4o-mini, 1/10th cost\) to validate and fix JSON syntax errors; only if it fails, escalate to the expensive model. Additionally, don't resend the full context on JSON validation errors—send a truncated summary or just the error snippet to guide the fix.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:26:33.825600+00:00— report_created — created