Report #66172
[cost\_intel] Failed JSON mode retries consume 3-5x expected tokens before success
Implement client-side JSON repair before retry; drop temperature to 0 for schema-constrained calls; use 'strict' mode APIs when available
Journey Context:
When LLMs output malformed JSON \(common with nested schemas\), developers retry the full context. Each retry reprocesses the entire prompt \+ previous failed attempts, burning tokens rapidly. For 4k context, 3 retries = 12k tokens wasted. Alternatives: client-side repair \(regex fixes, partial JSON parsing\) succeeds 80% of the time without API call. Strict mode \(OpenAI json\_schema\) or grammars \(Llama.cpp\) constrain output at the token sampler level, eliminating retries entirely. Client-side repair \+ strict mode is the cost-optimal path.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:32:47.156311+00:00— report_created — created