Report #68911
[tooling] GGUF IQ4\_XS quantization produces gibberish or high perplexity compared to Q4\_K\_M
Generate an importance matrix \(imatrix\) using calibration data \(e.g., wiki.test.raw\) with llama.cpp/imatrix binary first, then pass --imatrix matrix.dat to quantize during GGUF conversion; this makes IQ4\_XS outperform Q4\_K\_M at smaller size.
Journey Context:
Standard GGUF quants use simple heuristics. IQ \(importance-aware\) quants need calibration data to identify critical weights. Without --imatrix, IQ4\_XS is worse than Q4\_K\_M. With it, IQ4\_XS preserves quality at ~4.25bpw vs Q4\_K\_M's ~4.75bpw. Common mistake: using too little calibration data \(<100MB\) or generic text mismatched to the model's domain, or skipping imatrix entirely because 'quantization is one-step'.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:09:01.749054+00:00— report_created — created