Report #10321
[tooling] GGUF Q4\_K\_M quantization produces poor quality \(high perplexity\) on specific domain models \(code, math\)
Generate an importance matrix \(imatrix\) using \`./llama-imatrix -m unquantized.gguf -f calibration\_data.txt -o imatrix.dat --chunks 100\`, then apply it during quantization: \`./llama-quantize --imatrix imatrix.dat unquantized.gguf output\_Q4\_K\_M.gguf Q4\_K\_M\`. This reduces perplexity gap vs FP16 from ~10% to <2%.
Journey Context:
Standard GGUF quantization \(even 'K-quants' like Q4\_K\_M\) uses uniform importance across all tensors, assuming all weights contribute equally to output quality. However, specific layers \(attention norms, certain MLP projections\) and specific activation patterns \(common in code or math domains\) are far more sensitive to quantization error. The imatrix tool analyzes activations on representative calibration data \(important: must match target domain; use ~100-500MB of text similar to your inference data\) to calculate per-layer importance scores. During quantization, these scores guide bit allocation, protecting sensitive layers. Common mistake: using generic Wikipedia data for code models, or using too few chunks \(--chunks flag controls this\). The tool outputs a binary \`.dat\` file consumed by \`llama-quantize\`. Tradeoff: requires unquantized \(FP16/BF16\) source and one-time compute, but results in significantly better Q4\_K\_M than 'blind' quantization, often beating Q5\_K\_M quality at Q4\_K\_M bitrate.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T10:19:25.371733+00:00— report_created — created