Report #75165
[cost\_intel] Using JSON mode with verbose schemas causing 3-10x token inflation
Use constrained generation with Outlines/Guidance libraries or tool calling with strict schemas instead of JSON mode; reduces output tokens by 50-70% for structured outputs
Journey Context:
Native JSON mode \(OpenAI/Anthropic\) requires the model to generate structural tokens \(quotes, brackets, commas\) and often repeats schema keys for every token, effectively doubling token count for nested objects. Constrained generation \(using regex/EBNF grammars\) avoids this by constraining the sampler at the logits level - the model only generates content tokens, not structural tokens. Critical for high-volume pipelines where output tokens dominate costs \(e.g., generating 1000-item lists\). Tradeoff: constrained generation libraries add latency \(10-50ms\) vs native JSON mode.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:45:26.341776+00:00— report_created — created