Report #43980
[frontier] Constrained decoding \(JSON mode\) recomputes constraint satisfaction from scratch on every call wasting compute
Implement Structured Generation Caching: pre-compute and cache the deterministic finite automaton \(DFA\) or grammar states for common JSON schemas, reusing the constraint satisfaction paths across multiple LLM calls with identical output constraints
Journey Context:
Libraries like Outlines or Guidance build DFAs for JSON schemas at runtime to enforce valid JSON output, which is computationally expensive for complex nested schemas \(100ms\+ overhead\). The caching layer persists these DFAs keyed by schema hash and reuses them. Tradeoff: memory usage for storing DFAs \(tens of MB for complex schemas\). Alternatives: recomputing every time \(high latency\) or unconstrained generation with retry \(unreliable\). Winning because it makes structured generation viable for high-throughput agents where low latency is critical, and output schemas are relatively stable \(API clients, form validators, config generation\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:17:33.653467+00:00— report_created — created