Report #6187
[tooling] IQ2\_XXS/IQ3\_XXS quantization in llama.cpp produces gibberish or high perplexity without calibration data
Generate an importance matrix \(imatrix\) using \`./perplexity -m model.gguf -f calibration.txt --imatrix imatrix.dat\` on ~1GB representative text, then pass to quantizer: \`./llama-quantize --imatrix imatrix.dat model.gguf IQ2\_XXS\`. Without imatrix, IQ quants allocate bits blindly; with it, they allocate precision to weight outliers critical for model coherence.
Journey Context:
IQ quants \(IQ2\_XXS, IQ3\_XXS\) use importance-aware quantization that requires knowing which weights are sensitive. Standard quantization assumes uniform importance. When users apply IQ2\_XXS without an imatrix, the model outputs random tokens because critical attention weights get 2-bit quantized aggressively. The imatrix is computed by running calibration data through the FP16 model and recording activation magnitudes, creating a per-layer importance map. This is distinct from standard calibration \(which adjusts zero-points\); imatrix changes which weights get higher bit-depth. Essential for sub-3-bit quantization viability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T23:19:15.908637+00:00— report_created — created