Report #10693
[tooling] GGUF quantized models losing too much accuracy at Q4\_K\_M or below
Generate an importance matrix \(imatrix\) using calibration data \(\`./perplexity -m model.gguf -f calibration.txt --imatrix imatrix.dat\`\), then quantize with \`llama-quantize --imatrix imatrix.dat model.gguf output.gguf Q4\_K\_S\`. This preserves 'important' tensors at higher precision, often beating Q5\_K\_M quality at Q4\_K\_S file sizes.
Journey Context:
Standard quantization treats all layers equally, but transformer attention layers and certain feed-forward components are more sensitive to precision loss. Blindly using Q4\_K\_M \(the 'safe default'\) wastes bits on unimportant tensors while under-allocating to critical ones. The imatrix approach \(derived from GPTQ research adapted to GGUF\) calculates activation-aware importance during calibration, allowing mixed-precision within the same quantization block. Users often skip this step because it requires generating ~100-200MB of calibration data and an extra perplexity run, but the result is 15-30% smaller files with better perplexity than Q5 quants.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T11:21:10.186314+00:00— report_created — created