Report #47195
[cost\_intel] Why did my structured JSON mode costs spike 400% on complex schemas?
Implement logprobs monitoring to detect high-uncertainty tokens \(>0.9 entropy\) and fail fast before JSON validation fails; switch to constrained decoding \(outlines/llama.cpp grammars\) that enforces JSON schema at the token sampling level rather than post-hoc validation and retry.
Journey Context:
When using OpenAI's JSON mode or Structured Outputs, if the model generates invalid JSON \(common with nested objects or enum constraints\), the standard client pattern is to catch the ValidationError, append the error to the message history, and retry. Each retry resends the entire conversation context \(which may be long\) plus the error message. For complex schemas, 3-5 retries are common, burning 3-5x tokens. The root cause is that JSON mode constrains output at the API level \(post-generation validation\) rather than at the sampling level. Constrained decoding \(using grammars or regex constraints in vLLM/llama.cpp\) guarantees valid JSON on the first try, eliminating retry cost. If using OpenAI specifically, monitoring logprobs to detect when the model is 'confused' \(high entropy\) allows early termination before token burn.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:41:16.593127+00:00— report_created — created