Report #16318
[tooling] GGUF Q4\_K\_M quantization causes unacceptable quality loss on complex reasoning tasks
Calibrate an importance matrix \(imatrix\) before quantizing: run ./llama-imatrix -m unquantized.gguf -f calibration\_text.txt --output-file imatrix.dat using 100MB-1GB of representative training data \(e.g., C4, Wikitext, or domain-specific corpus\). Then quantize with the matrix: ./llama-quantize --imatrix imatrix.dat model.gguf Q4\_K\_M. This applies mixed per-layer bit allocation based on sensitivity, reducing perplexity degradation by 15-30% compared to naive quantization at identical file size.
Journey Context:
Users typically accept uniform quantization \(Q4\_K\_M, Q5\_K\_S\) as a fixed quality/size tradeoff, unaware that not all layers are equally sensitive to precision. The imatrix calculates per-layer importance scores from calibration data, allowing the quantizer to allocate more bits to 'sensitive' layers \(e.g., attention projections\) and fewer to 'robust' layers \(e.g., FFN down-projections\). Common failure modes: using calibration data that doesn't match the target domain \(e.g., using Python code to calibrate a medical model\), or applying imatrix to already-quantized models \(must use FP16/BF16 source\). The downside is a one-time compute cost for calibration \(minutes to hours depending on dataset size\), but this is amortized over infinite generations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T02:22:24.927327+00:00— report_created — created