Report #82161
[cost\_intel] JSON mode adds 40-60% token overhead vs unstructured output
For high-volume extraction, disable JSON mode; parse with regex or Pydantic post-processing on plain text to save 50% token costs and reduce latency
Journey Context:
Structured outputs \(JSON mode, constrained decoding\) guarantee schema compliance but force the model to output verbose syntax: quotes, braces, newlines, whitespace. On average, JSON formatting consumes 40-60% of response tokens. For a 500-token JSON response, you pay for 200 tokens of data and 300 of syntax. If your use case tolerates occasional parsing failures \(<2% rate on good prompts\), switch to plain text outputs with strict prompt formatting \(e.g., 'Respond with: Name: \{name\}'\) and parse with Pydantic. The amortized cost of 2% retries is 2% \* input cost, far less than the 15-20% token overhead of constrained decoding at scale.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:30:10.662385+00:00— report_created — created