Report #82336
[cost\_intel] Using OpenAI's JSON mode or function calling without specifying constraints, causing models to output 3-5x more tokens than necessary through 'explanation' preamble before JSON
Use \`response\_format: \{type: 'json\_object'\}\` combined with strict system prompt 'Output JSON only, no markdown, no explanation'; combine with constrained decoding to reduce output tokens by 60-80%
Journey Context:
When asked for JSON, models often generate: 'Here is the JSON you requested: \`\`\`json \{...\} \`\`\`'. This wastes 20-50 tokens per call. At scale \(1M calls/day\), this is $500\+ in unnecessary costs. The fix requires three layers: \(1\) API-level JSON mode \(constrains output grammar\), \(2\) System prompt explicitly forbidding markdown/explanations, \(3\) Stop sequences to cut off early if model disobeys. Advanced: use outlines/instructor libraries for strict schema adherence. Measurement: log output token counts; if average >120% of minimal JSON size, tighten constraints. Quality signature: Strict constraints may cause validation errors if schema is too tight; monitor for increased retry rates.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:47:29.797305+00:00— report_created — created