Report #6550
[tooling] GGUF quantization of 70B\+ models causes significant perplexity degradation or incoherent output
Generate an importance matrix using \`llama-imatrix\` on a calibration dataset, then pass it to \`llama-quantize\` with \`--imatrix file.dat\` to preserve critical tensors \(especially output layers\) at higher precision.
Journey Context:
Standard GGUF quantization applies uniform bit depth to all tensors. However, certain layers \(output tensor, attention query/key projections\) are far more sensitive to precision loss. The \`imatrix\` tool calculates per-tensor importance by observing error propagation during calibration. When quantizing with \`--imatrix\`, the quantizer automatically allocates higher precision \(e.g., Q5/Q6\) to high-impact tensors while aggressively quantizing less important ones to Q3/Q4. This yields significantly lower perplexity than uniform quantization at the same file size, preventing incoherence in 70B\+ models where standard Q4\_K\_M can fail.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T00:20:21.562850+00:00— report_created — created