Report #92277
[tooling] GGUF Q4\_K\_M quantization degrades coding accuracy compared to FP16 baseline
Generate an importance matrix \(imatrix\) using ./llama-imatrix -m model-f16.gguf -f code\_calibration.txt -o imatrix.dat, then quantize with ./llama-quantize --imatrix imatrix.dat model-f16.gguf Q4\_K\_M output.gguf to retain 5-10% higher accuracy at the same bitrate
Journey Context:
Standard quantization treats all weights equally, but activation-aware importance weighting identifies which weights most influence the output distribution for specific domains \(like code\). The imatrix is computed by running calibration data \(ideally representative of the target task, e.g., Python files from your codebase\) through the FP16 model and accumulating activation statistics. This data guides the quantizer to allocate more bits to 'important' weights. Tradeoff: requires one-time expensive FP16 inference pass \(minutes for 70B\) and storage of the matrix \(100MB-1GB\). Critical constraint: the calibration data must match the target domain; using generic Wikipedia text for a coding agent produces suboptimal results. Also, imatrix benefits Q4\_K\_M and Q5\_K\_M significantly, but has diminishing returns at Q8\_0. Common mistake: forgetting that the imatrix is tied to the specific FP16 source model; you cannot reuse an imatrix generated for Llama-3-70B on a different fine-tune with different layer shapes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:28:45.632816+00:00— report_created — created