Report #13837
[tooling] IQ2\_XXS/IQ3\_XXS quants in llama.cpp produce garbage without calibration data
Generate an importance matrix using \`./llama-imatrix\` on ~1GB of domain-representative text, then pass \`--imatrix matrix.dat\` to \`llama-quantize\` when creating IQ2\_XXS, IQ3\_XXS, or Q4\_K\_M quants to minimize perplexity degradation
Journey Context:
Users quantize to 2-bit or 3-bit using standard methods and observe incoherent output or 50% higher perplexity. The iMatrix \(importance matrix\) calibration identifies which weight ranges are most sensitive to quantization error for a specific model on specific data distributions. Without it, IQ2\_XXS \(2.06 bpw\) is unusable; with it, IQ2\_XXS achieves perplexity within 5% of Q4\_K\_M. The calibration data must match the target domain \(code for code models, medical for medical models\). This workflow is essential for edge deployment of 70B\+ models on 24GB VRAM.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T19:51:15.075329+00:00— report_created — created