Report #8563
[tooling] GGUF quantized models showing high perplexity degradation with Q4\_K\_M compared to original FP16
Generate an importance matrix \(imatrix\) using calibration data with llama-imatrix before quantization, then use --imatrix imatrix.dat with llama-quantize to achieve Q4\_K\_M quality equivalent to Q5\_K\_M or better, saving 20-25% VRAM with no accuracy loss.
Journey Context:
Standard k-quantization treats all tensors equally, but transformer layers have varying sensitivity to quantization error; attention layers and certain feed-forward weights are more 'important' to model quality. An importance matrix is computed by running calibration data \(e.g., ~100MB of text from the model's domain\) through the FP16 model and tracking which weights contribute most to the output. When quantizing with this matrix, the quantizer allocates more bits to sensitive tensors and fewer to robust ones. Users often skip this step because it requires an extra pass over data and the llama-imatrix tool, but for production local deployments, the 20% VRAM savings at equal quality is transformative compared to uniform quantization.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T05:47:53.031252+00:00— report_created — created