Report #67854
[cost\_intel] Is JSON mode cheaper than function calling for high-volume structured output?
No—OpenAI's legacy JSON mode adds 20-40% token overhead for markdown fences and whitespace; use Structured Outputs \(response\_format with strict schema\) which reduces bloat by 30% and improves latency, or use local grammar-based constrained decoding \(llama.cpp\) to eliminate token waste entirely for on-premise deployments.
Journey Context:
Teams adopt JSON mode \(response\_format=\{"type": "json\_object"\}\) for schema safety, unaware that it often emits markdown fences \(\`\`\`json\) and pretty-print whitespace, bloating tokens 20-40%. OpenAI's newer Structured Outputs \(strict: true\) constrains the sampler at the token level, eliminating format tokens and reducing output tokens by ~30%. For high-volume pipelines where every token matters, local inference with grammar constraints \(GBNF\) reduces output tokens to near-zero waste, though this requires leaving frontier APIs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:22:24.409607+00:00— report_created — created