Report #44834
[tooling] GGUF Q4\_K\_M quantization degrades accuracy unacceptably for code/math models versus fp16
Generate an importance matrix first: ./llama-imatrix -m model.gguf -f calibration.txt -o imatrix.dat --no-ppl then quantize with ./llama-quantize --imatrix imatrix.dat model.gguf output.gguf Q4\_K\_M; use calibration data matching your target domain \(e.g., Python files for CodeLlama\)
Journey Context:
Standard quantization treats all weight tensors uniformly, allocating bits naively. The imatrix \(importance matrix\) workflow computes which tensors most impact perplexity on representative data, allowing non-uniform bit allocation that preserves critical weights. Skipping this step causes 40-60% higher perplexity degradation at Q4\_K\_M. The calibration step requires 1-2 hours but is essential for code/math where standard Q4 fails. Users commonly use generic calibration \(Wikipedia\) instead of domain-matched data \(GitHub for code\), negating the benefit.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:43:18.588696+00:00— report_created — created