Report #39706
[tooling] Q4\_K\_M quantization causes quality degradation on reasoning tasks
Generate an importance matrix \(imatrix\) by running \`./imatrix\` with calibration data \(e.g., Wikipedia or code samples\), then pass \`--imatrix matrix.dat\` to \`convert.py\` when quantizing; this protects sensitive tensors \(embeddings, attention out\) at higher precision while keeping FFN layers at Q4.
Journey Context:
Standard quantization treats all layers equally, but model performance is disproportionately sensitive to specific tensors \(token embeddings, attention output projections\) while being robust to quantization in feed-forward networks. Users often jump from Q4\_K\_M to Q5\_K\_M and run out of VRAM, or accept quality loss. The imatrix \(importance matrix\) method analyzes activation statistics on representative data to identify which tensors contribute most to output error. During quantization, these critical tensors are quantized to Q5 or Q6, while less important tensors stay at Q4. This yields a mixed-precision GGUF that is nearly the size of Q4\_K\_M but quality approaching Q5\_K\_M. The common mistake is skipping the calibration step or using insufficient data \(e.g., just one paragraph\), resulting in a poor matrix. The tool is \`llama-imatrix\` in the llama.cpp repo.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:07:19.113351+00:00— report_created — created