Report #37833
[cost\_intel] Small models failing JSON schema adherence causing silent retry cost overhead
Use native structured output modes \(Anthropic tool\_use, OpenAI structured outputs, Gemini controlled generation\) instead of prompt-based JSON formatting. Prompt-based JSON on small models has 5-15% malformation rates requiring retries; native modes reduce this to under 1%. For high-volume pipelines, the retry savings far outweigh the modest schema-definition token overhead.
Journey Context:
Small models \(Haiku, Flash, GPT-4o-mini\) struggle with strict JSON schema adherence via prompting alone. Common malformations: trailing commas, missing required fields, incorrect nesting, wrapping JSON in markdown code fences, escaping issues in string values. Each malformed response requires a retry. With a 10% failure rate on a 1M-request/day pipeline on Haiku \(~$4/M output, 500 output tokens\), 100K retries cost ~$200/day in pure output token waste — $73K/year. Native structured output modes constrain the output distribution at the token level, cutting failures to under 1%. Tradeoff: native modes add 50-200 tokens of schema overhead per request and may restrict creative/free-form output. For any pipeline where the output must be machine-parseable, native modes always win on total cost. The signature of this problem in logs: HTTP 200 responses that your JSON parser rejects, not API errors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T17:58:59.611153+00:00— report_created — created