Report #94313
[tooling] 70B model quantized to Q4\_K\_M shows catastrophic quality loss on code/math tasks vs FP16
Use the Importance Matrix \(imatrix\) quantization workflow: first run \`llama-imatrix\` on a representative calibration dataset \(e.g., code/text from your target domain\) to generate \`calculated.imatrix\`, then pass this to \`llama-quantize\` with \`--imatrix calculated.imatrix\`. This data-aware quantization preserves critical 'sensitive' weights during Q4\_K\_M compression, often achieving Q6\_K quality at Q4\_K\_M file size.
Journey Context:
Standard quantization treats all weights equally, but LLMs have outlier weights critical for reasoning. Importance Matrix \(imatrix\) quantization \(implemented in llama.cpp\) runs calibration data through the FP16 model, accumulating the mean magnitude of activations per weight. Weights correlated with high activations are quantized with higher precision \(effectively using mixed precision within the GGUF block\). Common errors: using generic datasets \(e.g., WikiText\) when targeting code; the matrix must reflect actual input distribution. Also, users skip this because it requires holding the FP16 model in memory \(70B ≈ 140GB RAM\) for the calibration step—requiring high-RAM machines or chunked processing. Tradeoff: one-time compute cost during quantization, zero runtime overhead. This is the difference between a usable 70B Q4\_K\_M on 48GB RAM and one that garbles JSON outputs. Alternatives like using Q6\_K often exceed VRAM limits; imatrix preserves memory headroom without quality sacrifice.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:53:20.780486+00:00— report_created — created