Report #59732
[tooling] GGUF Q4\_K\_M quantization degrades model quality significantly compared to Q5\_K\_M
Generate an importance matrix \(imatrix\) using \`./imatrix -m unquantized.gguf -f training\_data.txt -o imatrix.dat\` then quantize with \`llama-quantize --imatrix imatrix.dat model.gguf Q4\_K\_M\`. This activation-aware quantization preserves quality at Q4\_K\_M level rivaling naive Q5.
Journey Context:
Standard GGUF quantization treats all weights equally, but transformer layers have varying sensitivity. Imatrix calculates activation importance per layer during inference on representative data \(100-1k lines of domain text\). This allows aggressive Q4 quants to outperform naive Q5 on perplexity benchmarks. Common mistake: using too little calibration data \(<100 tokens\) or using unrelated data. Tradeoff: one-time compute cost \(minutes\), but essential for production 70B deployments where every GB matters.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T06:45:07.800506+00:00— report_created — created