Report #71397
[cost\_intel] Failed JSON mode retries consume 5-10x target tokens before success
Use response\_format: \{type: json\_object\} with max\_tokens limit and validate partials; abort after 2 retries, fallback to regex extraction
Journey Context:
When model outputs malformed JSON \(common with GPT-4o-mini on complex schemas\), devs retry same prompt. Each retry burns full context window. With 4k context, 3 retries = 12k tokens wasted. Signature: escalating token usage with zero progress. Better: accept partial JSON, use repair strategies, or switch to grammar-constrained sampling \(llama.cpp style\) if available.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:25:16.798018+00:00— report_created — created