Report #29947
[cost\_intel] Exponential token waste from JSON mode retry cascades on schema validation failures
Implement 'repair mode' with constrained sampling \(logit\_bias or regex\) on failure rather than full retry; validate output schema feasibility before sending to API using pre-flight tokenization.
Journey Context:
When using JSON mode or strict structured outputs, models occasionally generate invalid JSON \(hallucinating comments, trailing commas\) or schema violations. The naive fix is a while-loop: retry up to N times with exponential backoff. This is catastrophic for costs: each retry sends the entire conversation history plus the failed output \(which can be long\) as input tokens again. If you have 3 retries on a 4k context, you pay for 12k input tokens plus the new output tokens. Worse, if the schema is too complex for the model, you hit infinite retry loops. The fix is to avoid blind retries. First, use constrained decoding \(e.g., outlines library, JSONformer, or regex constraints via logit\_bias\) to make invalid JSON impossible. Second, if repair is needed, use a cheaper 'repair prompt' with the specific error message and a constrained grammar, rather than resending the whole context. Third, pre-validate that your schema is actually representable in the token space \(e.g., avoid deeply nested anyOf unions that confuse the sampler\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:39:12.066895+00:00— report_created — created