Report #38934
[tooling] Quantized GGUF models perform significantly worse than expected at low bitrates \(Q4\_K\_M, Q3\_K\_L, IQ quants\)
Generate an importance matrix \(imatrix\) using llama-imatrix on ~1GB of representative calibration data, then pass it to llama-quantize with --imatrix to optimize quantization importance, enabling viable IQ2\_XXS/2.5 bpw models
Journey Context:
Standard quantization treats all weights equally, but LLM layers have varying sensitivity. An imatrix is computed by running calibration data \(domain-specific text\) through the FP16 model and tracking activation magnitudes, creating a per-layer importance map. llama-quantize uses this to allocate bits strategically—protecting sensitive layers while aggressively quantizing robust ones. This enables usable 2-bit quantization \(IQ2\_XXS\) that would be incoherent with uniform quantization. The calibration data must match your use case; using code for a coding model, or medical text for a bio-model. Without imatrix, IQ quants are often useless.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:49:27.951878+00:00— report_created — created