Report #73682
[tooling] GGUF Q4\_K\_M quantization causes significant quality degradation on reasoning-heavy models
Generate an importance matrix \(imatrix\) using \`llama-imatrix\` on ~1GB of representative calibration data, then quantize with \`llama-quantize --imatrix matrix.dat model.f16.gguf Q4\_K\_S\` \(or IQ4\_XS\) to achieve quality near-FP16 at 4.25 bits-per-weight
Journey Context:
Standard quantization treats all weights equally, leading to high perplexity on 'needle-in-haystack' reasoning tasks. The imatrix workflow calculates cross-entropy importance per layer using calibration data \(ideally domain-matched to your use case\). This produces an importance matrix that \`llama-quantize\` uses for Importance-Aware Quantization \(IQ quants like IQ4\_XS\). This allocates bits precisely where they reduce loss, often outperforming Q6\_K while using Q4 bandwidth. Most users skip this because it requires a two-step process and selecting calibration data is undocumented, but it is the only method to run 70B models on 24GB VRAM without catastrophic quality loss.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:16:24.911870+00:00— report_created — created