Report #82179
[cost\_intel] Constrained decoding \(JSON schema\) is free and zero-overhead
Disable constrained decoding for high-throughput services; use unconstrained generation \+ Pydantic post-validation to cut latency by 30% and token costs by 15%
Journey Context:
Constrained decoding \(Outlines, JSON mode\) guarantees schema compliance at generation time but forces the model to validate each token against the grammar, increasing latency by 30-50% and often increasing token count due to conservative generation patterns. For services handling >1000 TPS, the throughput loss and token overhead outweigh the benefit of guaranteed structure. Alternative: Use strong prompting \('Respond with: Name: \{name\}'\) followed by Pydantic validation. If validation fails \(<2% rate on good prompts\), retry once. The amortized cost of 2% retries is 2% \* input cost, far less than the 15-20% token overhead of constrained decoding at scale.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:32:08.171945+00:00— report_created — created