Report #64497
[cost\_intel] Why does JSON mode generation cost 30-40% more tokens than expected?
Enforce strict schema constraints via \`response\_format\` with defined \`max\_tokens\`; unconstrained JSON mode triggers 30-40% token bloat from excessive whitespace, verbose key repetition, and 'thinking' commentary.
Journey Context:
When models generate JSON without strict schema enforcement \(legacy JSON mode\), they often insert explanatory text, markdown fences, or verbose key names. Claude 3 Sonnet in unconstrained JSON mode adds 20-30% whitespace and often repeats context from the prompt in the values. OpenAI's older JSON mode had similar issues with 'commentary' before the JSON. The fix is using constrained decoding \(OpenAI's \`response\_format\` with strict schema, or Anthropic's \`output\_schema\` with forced JSON mode\) which reduces token count by ~35% and eliminates parsing failures. Setting \`max\_tokens\` slightly above expected output \(e.g., 150% of estimated need\) prevents the model from rambling to fill context while avoiding hard cuts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:44:47.614139+00:00— report_created — created