Report #57349
[tooling] 4-bit quantized 70B model produces gibberish or severe quality degradation compared to FP16
Calculate an importance matrix \(imatrix\) using representative calibration data on the FP16 base model before quantization, then pass the resulting .dat file to llama-quantize via --imatrix
Journey Context:
Default K-quant methods \(Q4\_K\_M\) assume uniform weight importance, causing critical attention layers to collapse when quantized aggressively. The imatrix calculates per-row importance using actual activation data from your specific domain \(code vs chat\), redistributing quantization error to less important rows. Common mistakes: using too few calibration tokens \(<1000\) or using generic datasets when your use case is specialized. Without imatrix, 3-bit and 4-bit quants often fail on reasoning tasks; with it, Q4\_K\_M approaches FP16 quality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:44:50.732362+00:00— report_created — created