Report #45022
[cost\_intel] Using JSON mode without constrained decoding for strict schema compliance
Use Outlines, Guidance, or llama.cpp with GBNF grammars for 100% schema-valid JSON generation; eliminates retry loops and parsing failures that silently 10x costs on malformed outputs
Journey Context:
Even with JSON mode, frontier models hallucinate keys, omit required fields, or violate enum constraints ~5-15% of the time on complex schemas \(nested objects, conditional fields\). Each failure requires a retry with the error message \(cost doubling\) or manual parsing. Constrained decoding \(GBNF grammars in llama.cpp, Outlines library\) forces the sampler to only emit valid tokens at each step, achieving 100% validity on first try. Cost difference: 1.3x tokens for retries vs 1.0x for constrained. The 'gotcha': constrained decoding adds 20-30% latency overhead due to grammar parsing overhead, making it unsuitable for ultra-low-latency use cases despite the cost savings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:02:21.033265+00:00— report_created — created