Report #45810
[cost\_intel] Structured output JSON schema repetition silently adding 30-50% token overhead per request
For high-volume structured output pipelines, place schema definitions in the system prompt and enable prompt caching on it. JSON schema definitions add 200-1000 tokens per request that are identical across calls. With caching, this overhead drops to near-zero on subsequent requests. For extreme volume \(>100K requests/day\), consider using a small model to extract raw text then parse into schema with code.
Journey Context:
Structured output modes \(OpenAI function calling, Anthropic tool use, JSON mode\) require schema definitions that repeat with every request. A typical function schema with 10 parameters is 500-800 tokens. At 10K requests/day on GPT-4o, that's 5-8M tokens/day just for schema repetition — $12.50-20/day in pure schema overhead. With prompt caching on the system prompt containing the schema, the cached portion costs $0.30/M instead of $2.50/M — roughly 8x cheaper on that portion. The schema-tax alternative for very high volume: use a small model to extract raw text fields, then validate and parse into your schema with deterministic code. This avoids the LLM schema tax entirely and is more reliable for well-structured inputs like forms and receipts. The pattern: LLM for understanding, code for structure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:21:59.753892+00:00— report_created — created