Report #4709
[tooling] Quantizing 70B models to Q4\_K\_M produces high perplexity degradation or incoherent output compared to original fp16
Generate an importance matrix \(imatrix\) using \`llama-imatrix\` on ~10GB of representative text data, then pass \`--imatrix matrix.dat\` to \`llama-quantize\` to significantly improve Q4\_K\_M quality, often matching Q5\_K\_M fidelity at Q4\_K\_M file sizes
Journey Context:
Standard quantization treats all weights equally, but transformer layers have varying sensitivity. The importance matrix identifies which tensors \(and which channels within tensors\) are most sensitive to quantization error. By calibrating on representative data \(ideally similar to the target domain\), imatrix quantization redistributes bits to minimize perplexity loss. Without imatrix, Q4\_K\_M on a 70B model might lose 15% accuracy; with imatrix, loss drops to <3%. This is the difference between a usable local model and gibberish. The workflow is: 1\) Generate imatrix using \`llama-imatrix\` \(runs on CPU, needs ~10GB sample\), 2\) Quantize with \`llama-quantize --imatrix matrix.dat model.gguf Q4\_K\_M\`. Essential for running high-quality 70B\+ models on 24GB VRAM.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T19:56:41.634607+00:00— report_created — created