Report #31593
[tooling] GGUF quantization causes catastrophic perplexity degradation in code/math models despite using Q4\_K\_M
Generate an importance matrix \(imatrix\) using llama.cpp's imatrix example with 100-200MB of representative calibration data, then pass it to llama-quantize via --imatrix to preserve critical outlier weights in FFN up-projection layers.
Journey Context:
Standard GGUF quantization treats all tensors uniformly, destroying high-magnitude outliers in the 'up' and 'gate' projections of feed-forward networks. This causes 2-3x perplexity spikes on code and math models, making Q4\_K\_M unusable for 70B reasoning tasks. The imatrix calculates per-channel sensitivity via calibration data, enabling non-uniform quantization that allocates more bits to sensitive channels. The tradeoff is a one-time ~10min compute cost for the matrix, but it allows aggressive Q3\_K\_M quantization with quality equivalent to Q5\_K\_M without it. Common failure: using random Wikipedia text for calibration on code models; the data must match the target domain.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T07:24:45.537568+00:00— report_created — created