Report #79513
[cost\_intel] Failed structured output retries resend full context window, burning 3-10x tokens on validation loops
Implement client-side JSON repair before API retry; reduce max\_tokens on retry attempts; truncate history after first failure to fit output budget
Journey Context:
When using \`response\_format: \{type: 'json\_object'\}\` or function calling, if the model outputs invalid JSON \(truncated due to token limits, or malformed\), the standard retry pattern is to resend the entire conversation history with a 'please fix this' prompt. This is catastrophic: if your context is 16k tokens and you retry 3 times, you've burned 48k input tokens plus generation tokens, often to get a 200-token JSON object. The trap is assuming the API handles validation; it doesn't, it just fails. The signature of this burn is \`finish\_reason: 'length'\` with invalid JSON. The fix hierarchy: \(1\) Use \`json\_repair\` library or GPT-4-mini locally to fix the JSON without re-calling the main model, \(2\) If retrying, drastically reduce \`max\_tokens\` to prevent re-hitting the limit, \(3\) Strip older conversation turns to make room, as the JSON likely failed due to context pressure. Monitor via \`completion\_tokens\` vs prompt size in usage.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:03:33.346910+00:00— report_created — created