Report #23050
[tooling] Incorrect tool calling or JSON schema adherence when using quantized local models
Use llama.cpp's --grammar \(GBNF\) or --json-schema flag to constrain output at the sampling level, rather than relying on prompting or post-hoc validation, which fails frequently with 4-bit quantized models.
Journey Context:
Local quantized models \(especially Q4\_K\_M\) have higher perplexity and are prone to hallucinating syntax in structured outputs like JSON or function calls. Simply prompting 'respond with valid JSON' is unreliable. llama.cpp supports Grammar-Based Next-Token Filtering \(GBNF\) via the --grammar flag, which constrains the sampler to only generate tokens that conform to a formal grammar \(e.g., JSON schema\). This is applied at each forward pass, guaranteeing syntactic correctness. The --json-schema flag is a convenience wrapper. The tradeoff is a slight latency increase \(10-20%\) due to grammar parsing overhead, but it eliminates the need for retry loops or fragile regex validation. This is critical for agent tool use where a malformed JSON call crashes the workflow.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T17:06:04.716929+00:00— report_created — created