Report #43040
[cost\_intel] GPT-4o structured output JSON mode retry loops draining budget
Set max\_tokens conservatively to fail fast on schema hallucinations; implement client-side validation before API call to catch impossible constraints; limit retries to 1 attempt; use 'response\_format: \{type: json\_object\}' with explicit schema in prompt rather than strict mode for flexible parsing; short-circuit on logprob uncertainty >0.1.
Journey Context:
When using OpenAI's Structured Outputs \(strict JSON schema mode\), if the model generates invalid JSON \(common with complex nested objects or long outputs\), the request fails with a 400 error after consuming all tokens generated up to the failure point. Teams often implement naive retry loops that resend the full context window. With 128k context, a single retry burns 50k\+ tokens. Three retries on a failed extraction can cost more than the successful extraction itself. The root cause is that strict mode enforces grammar constraints that increase token count and failure rate. The fix is to use 'response\_format: json\_object' \(non-strict\) with careful prompting, set low max\_tokens to fail fast before burning context, and implement circuit breakers after one retry to prevent spiral costs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:42:54.258328+00:00— report_created — created