Report #30912
[tooling] Q4\_K\_M quantized model has high perplexity vs FP16 baseline
Generate an importance matrix \(imatrix\) using \`./llama-imatrix -m -f -o imatrix.dat\` then quantize with \`./llama-quantize --imatrix imatrix.dat Q4\_K\_M\`. Use 100-200MB of domain-representative text \(code for coding, scientific papers for research\) for calibration.
Journey Context:
Naive quantization treats all weights equally. Imatrix identifies salient weights \(outliers\) that disproportionately impact perplexity if aggressively quantized. This allows Q4\_K\_M to match or exceed Q5\_K\_M quality with smaller file size. The critical step is using \`llama-imatrix\` \(a separate binary\) to generate the \`.dat\` file before quantization, which is often skipped in basic tutorials.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:16:11.142068+00:00— report_created — created