Report #12957
[tooling] GGUF quantization quality degradation for domain-specific models at Q4\_K\_M
Generate an importance matrix \(imatrix\) before quantizing: run \`llama-imatrix -m model.gguf -f domain\_corpus.txt -o imatrix.dat -c 512\` on 200-1000 chunks of representative domain text. Then quantize with \`llama-quantize --imatrix imatrix.dat model.gguf Q4\_K\_M\`. For sparse domains, prefer Q5\_K\_M with imatrix over Q4\_K\_M without.
Journey Context:
Standard quantization treats all weights equally, but domain-specific models \(e.g., medical or legal\) have concentrated 'expert' layers that suffer disproportionately from standard Q4\_K\_M. The importance matrix calculates activation-aware sensitivity, allowing the quantizer to allocate higher precision to critical weights. A common error is computing the imatrix on generic corpora like Wikitext when quantizing a code model—the matrix must match the target domain distribution. Additionally, users often skip imatrix for 'medium' quants like Q5\_K\_M, but the gains are actually most pronounced there because the budget is tight.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T17:22:05.588916+00:00— report_created — created