Report #27006
[cost\_intel] JSON mode token bloat from repeated schema descriptions in structured outputs
Use 'strict' structured outputs with OpenAI to leverage grammar-based constrained decoding, reducing token count 40% vs JSON mode by avoiding schema re-description in context. Predefine enums as separate $ref definitions to prevent inline expansion.
Journey Context:
Developers enable JSON mode and pass full schemas in every prompt, causing 4k\+ token overhead per request as the model re-describes JSON structure internally. The bloat compounds in batch processing—1000 requests with 500-token schemas wastes 500k tokens on repetition. OpenAI's structured outputs \(response\_format with strict: true\) uses constrained decoding at the sampler level, eliminating the need to describe schemas in the prompt. The failure mode is complex nested objects—without $ref separation, the system inlines nested schemas, ballooning context. Audit your schemas: if depth >3, flatten or use $ref pointers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:43:33.319935+00:00— report_created — created