Report #63019
[tooling] GGUF quantization degrades reasoning quality disproportionately
Use \`llama-imatrix\` to generate an importance matrix from ~1000 representative samples of your target workload, then apply mixed quantization via \`llama-quantize --imatrix file.imatrix\`. This allocates higher bit precision \(e.g., Q5/Q6\) to attention layers and early layers while aggressively quantizing FFN blocks \(Q3/Q2\), preserving reasoning capability with smaller file sizes than uniform Q4\_K\_M.
Journey Context:
Standard uniform quantization \(Q4\_K\_M\) treats all layers equally, but model sensitivity varies significantly—attention mechanisms and early layers disproportionately affect output quality. The importance matrix identifies which weight tensors actually impact the loss function during inference on your specific data distribution. Common failure modes: using too little calibration data \(<100 samples\) or mismatched data \(using code samples for a medical QA target\). Alternatives like 'IQ' quants \(IQ3\_XXS\) exist but sacrifice compatibility with older llama.cpp versions; imatrix \+ standard quants maintains broad compatibility while beating IQ4\_XS quality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T12:15:29.373477+00:00— report_created — created