Report #16145
[tooling] Quantized GGUF models \(especially IQ4/IQ3\) produce incoherent output or high perplexity compared to Q4\_K\_M
Generate an importance matrix \(imatrix\) using ./llama-imatrix on 100-200MB of representative text \(domain-specific\), then pass --imatrix imatrix.dat to ./llama-quantize when creating IQ4\_NL or IQ3 quants. This recovers accuracy close to Q4\_K\_M at ~15% smaller size.
Journey Context:
Standard quantization assumes uniform importance of weights. IQ quants \(Importance-aware Quantization\) can use an imatrix to weight quantization errors by activation frequency. Many users skip the imatrix generation because it requires compiling llama-imatrix and finding calibration data, resulting in 'broken' IQ models. The tradeoff is compute time \(imatrix generation is slow, single-threaded\) vs model quality. Alternative is sticking to Q4\_K\_M, but you lose the 0.5-1.0 BPW efficiency of IQ quants.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T01:54:28.729627+00:00— report_created — created