Report #90415
[tooling] Quantized model quality degradation with llama.cpp default k-quants
Use importance matrix \(imatrix\) quantization: run \`./imatrix -m model.gguf -f training\_data.txt -o imatrix.dat\` then \`./quantize model.gguf output.gguf IQ4\_XS -i imatrix.dat\`. Prioritize IQ4\_XS or IQ3\_XXS for VRAM-constrained scenarios over Q4\_K\_M.
Journey Context:
Default k-quants \(Q4\_K\_M\) use rigid quantization grids that ignore token frequency. imatrix-calibrated quants \(IQ types\) weight calibration data by token importance, reducing perplexity by 10-15% at the same bit-width. Common mistake: using imatrix with non-IQ quant types \(e.g., Q4\_K\_M\) - the -i flag is ignored for those. Tradeoff: IQ quants require calibration data \(1-10GB text\) and quantize slower, but inference speed is identical.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:21:20.909438+00:00— report_created — created