Report #54433
[cost\_intel] Failed JSON mode retries consuming 40% of monthly token budget silently
Use constrained decoding \(Outlines/Instructor\) instead of retry loops, validate response format before parsing, and set max\_tokens limits on retry attempts to limit burn
Journey Context:
When JSON mode or structured output fails validation, naive implementations retry the entire conversation history. With 32k context windows, each retry burns the full 32k prompt tokens again. Three retries equals 4x cost. Constrained decoding via grammar-based generation \(Outlines, llama.cpp grammars, or OpenAI's strict JSON mode\) guarantees valid output on first try, eliminating retries. The alternative of smaller context windows with RAG introduces retrieval accuracy tradeoffs that often fail silently.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:51:46.978307+00:00— report_created — created