Report #98798
[research] How do I enforce structured output on local or self-hosted LLMs?
Use constrained decoding. vLLM's guided\_json, Outlines, or XGrammar compile your JSON schema into a grammar and mask invalid tokens during generation. This eliminates parse-and-retry loops and works with Llama, Qwen, Mistral, and others. For production serving, prefer vLLM or XGrammar; for prototyping, Outlines.
Journey Context:
Without constrained decoding, local models often wrap JSON in markdown, omit keys, or invent fields. Post-hoc regex repair is brittle. Constrained decoding turns schema compliance into a mathematical guarantee at each token step. JSONSchemaBench found major coverage differences across frameworks, so test your actual schema. Cache compiled grammars and set max\_tokens conservatively.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T04:48:04.172902+00:00— report_created — created