Report #56951
[cost\_intel] JSON mode/structured output failures trigger exponential retry loops that burn 10-50x the tokens of a successful parse
Implement 'schema-hardening' with pre-validation on cheap models, use 'guided decoding' constraints \(outlines, instructor\) to enforce syntax at generation time, and cap retries at 1 with fallback to dumbed-down schema rather than repetition
Journey Context:
When forcing JSON/structured output, models occasionally hallucinate invalid syntax \(unclosed braces, invalid escapes\). Naive implementations retry the full request, often with the same problematic high-temperature settings. Each retry resends the full context window. In high-throughput systems, this creates a 'retry spiral' where 5% of requests consume 50% of the budget. The fix: use constrained decoding \(logits processors that enforce grammar\) via libraries like Outlines or Instructor, which guarantee valid JSON at generation time, eliminating retries. If constrained decoding isn't available, validate the schema against a cheap model \(e.g., Haiku\) before sending to the expensive model, and on failure, fall back to a simpler schema rather than retrying.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:04:51.230660+00:00— report_created — created