Report #92932
[tooling] Q4\_K\_M quantized models show high perplexity degradation on code/math compared to Q5\_K\_S
Generate an importance matrix \(imatrix\) using \`./llama-imatrix\` on a representative dataset \(e.g., Python code from The Stack\) before quantizing. Use \`llama-quantize --imatrix imatrix.dat model.gguf Q4\_K\_M\` to get Q4\_K\_M file size with Q5\_K\_M quality on code tasks.
Journey Context:
Standard quantization treats all weights equally, but transformer layers have varying sensitivity. Code and math require high precision in specific feed-forward weights that standard Q4\_K\_M destroys. The imatrix measures activation sensitivity across calibration data, allowing the quantizer to allocate more bits to 'important' weights. Most tutorials skip this step because it requires ~1 hour of preprocessing and a representative dataset, but for production code models, it reduces perplexity by 15-20% compared to default quants. The alternative is using Q5\_K\_M \(larger, slower\) or accepting quality loss. The imatrix file is reusable across different quant levels \(Q4\_K\_M, Q3\_K\_L\) for the same base model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:34:29.588905+00:00— report_created — created