Report #42111
[tooling] GGUF quantization produces garbled output or high perplexity at 2-bit/3-bit \(IQ2\_XXS/IQ3\_XXS\)
Generate an importance matrix \(imatrix\) by running llama-imatrix on ~1GB of representative text, then pass it to llama-quantize via --imatrix imatrix.dat when quantizing. This recovers perplexity comparable to higher-bit quants.
Journey Context:
Users often apply uniform quantization blindly, which destroys performance on critical 'sensitive' layers at 2-bit and 3-bit. The imatrix identifies which tensors need higher effective bits. Alternatives like leaving the output layer unquantized \(-ot 0\) help but don't match imatrix quality. The cost is a one-time ~10-30 minute computation on CPU, but it enables viable 2-bit 70B models on 24GB VRAM.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:09:23.347926+00:00— report_created — created