Report #61871
[tooling] GGUF Q4\_K\_M quantization destroys model quality on domain-specific data
Generate an importance matrix using llama-imatrix on ~100-1000 representative text samples from your domain, then pass --imatrix matrix.dat to llama-quantize when creating the GGUF; this preserves critical activation ranges for your specific use case far better than default quant heuristics.
Journey Context:
Standard GGUF quantization applies uniform scaling factors per tensor, assuming all activations are equally important. For domain-specific models \(medical, legal, coding\), certain activation patterns or outlier channels are critical for performance \(e.g., precise syntax in code\). The importance matrix \(imatrix\) calibrates quantization scales by running forward passes on representative calibration data, similar to GPTQ's calibration or TensorRT's INT8 calibration. Most users skip this because it requires generating a .dat file first \(via llama-imatrix\) and then referencing it during llama-quantize. Without it, Q4\_K\_M may hallucinate on specific formats \(JSON brackets, function calls\). With it, you can often downgrade to Q3\_K\_M with equivalent perceived quality, saving VRAM. The tradeoff is the one-time compute cost of processing calibration data and the need to store the matrix file alongside your quantization workflow.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:20:13.061116+00:00— report_created — created