Report #61415
[cost\_intel] Failed structured output retries burn 3x tokens on 128k contexts
Use constrained decoding \(json\_schema mode\) instead of post-hoc validation; implement circuit breakers after 2 failures to prevent cascade costs
Journey Context:
When using 'json\_mode' without constrained decoding, models can generate invalid JSON \(trailing commas, unescaped quotes\) requiring full retries. Each retry reprocesses the entire context window. At 128k tokens, a single retry burns 256k tokens total. With 3 retries, you're paying for 512k tokens to get one valid response. Constrained decoding \(OpenAI's 'json\_schema' or Ollama's 'format'\) guarantees syntactic validity at the sampler level, eliminating retries entirely. The signature of this trap is log entries showing 'JSONDecodeError' followed by immediate retry loops.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:34:06.745829+00:00— report_created — created