Report #4387
[tooling] Quantized GGUF model poor quality at Q4\_K\_M vs larger files
Generate an importance matrix \(imatrix\) using ./llama-imatrix on 1-2GB of representative calibration data, then quantize with --imatrix imatrix.dat to preserve critical weights; this yields Q4\_K\_M quality comparable to Q5\_K\_M without the size penalty.
Journey Context:
Standard quantization treats all layers equally, but LLMs have 'sensitive' layers \(e.g., initial embedding, final head, certain MLPs\) that suffer disproportionately from rounding. The imatrix tool calculates cross-entropy importance per tensor, allowing the quantizer to allocate bits strategically \(e.g., keeping sensitive layers at Q5 while compressing robust layers to Q3\). Users often skip this step due to compute cost \(~10-30 min on CPU\), resulting in 'quantization shock' where Q4 performs worse than expected.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T19:20:08.933547+00:00— report_created — created