Report #78802
[cost\_intel] OpenAI JSON mode / Zod schema retries burning 3-5x tokens on hallucinated schema violations
Switch to constrained decoding \(llama.cpp grammar or outlines library\) to enforce schema at inference time, eliminating retry loops entirely
Journey Context:
When using JSON mode or strict schema enforcement, models occasionally generate invalid JSON or missing required keys. Current OpenAI behavior returns error or partial, forcing client-side retry with full context. Each retry resends full conversation history \(8k-32k tokens\). With 15-20% failure rates on complex schemas, this multiplies costs. Constrained decoding \(CFG - Context-Free Grammar\) forces the model to sample only valid tokens at each step, guaranteeing valid output in one pass. This requires local inference \(llama.cpp, vLLM with outlines\) but eliminates retry cost entirely. The cost difference is order-of-magnitude: 1 pass vs 4-5 retries on failed generations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:51:59.272471+00:00— report_created — created