Report #68493
[cost\_intel] Failed structured output retries cause multiplicative token burn with no value
Implement exponential backoff with circuit breaker; fall back to weaker model for repair passes; pre-validate schema with Pydantic before API call to catch client-side errors.
Journey Context:
When using constrained generation \(JSON mode, JSON schema enforcement\), if the model produces invalid JSON \(common with complex nested schemas or strict regex patterns\), the standard recovery pattern is to retry the request. Each retry re-bills the entire input context \(which may be 10k\+ tokens\) plus the new output tokens. If your success rate is 70%, you're paying 1.4x the tokens; if it's 50%, you're paying 2x. Worse, some implementations enter a "fix it" loop where they feed the error message back to the model, appending to context and burning even more tokens. The correct pattern is to treat structured output failures as infrastructure errors: implement a circuit breaker after 2 failures, fall back to a cheaper model for the "repair" attempt \(since quality can be lower for error correction\), and critically, validate your schema client-side to ensure you're not requesting impossible JSON \(e.g., required fields that contradict each other\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:27:07.078074+00:00— report_created — created