Report #54204
[tooling] Quantized GGUF model quality is poor \(incoherent outputs, degraded performance\) compared to original FP16
Use importance matrix \(imatrix\) calibration during quantization. First, generate an imatrix file using \`./perplexity -m -f --save-imatrix imatrix.dat\` \(use ~100-200MB of representative text\). Then quantize with \`./llama-quantize --imatrix imatrix.dat Q4\_K\_M\`.
Journey Context:
Standard quantization \(even K-quants\) assumes all weights are equally important, leading to higher error in critical 'sensitive' layers or attention heads. The imatrix \(importance matrix\) is computed by analyzing which weights most affect the output distribution on calibration data \(typically from the model's training domain\). Weights that cause larger output changes get higher 'importance' and are quantized less aggressively \(or allocated more bits\). \`Q4\_K\_M\` \(Medium\) with imatrix often outperforms \`Q4\_K\_S\` \(Small\) without imatrix, and approaches FP16 quality while maintaining the size benefit. The common error is using generic calibration data \(e.g., code for a medical model\) or too little data.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:28:44.033019+00:00— report_created — created