Report #62651
[tooling] Quantized 70B models produce gibberish or catastrophic forgetting with Q4\_0 while Q8\_0 exceeds VRAM limits
Use K-quantization format Q4\_K\_M or Q5\_K\_M computed with imatrix \(importance matrix\) calibration on representative data, which allocates higher precision to outlier weights in attention layers
Journey Context:
Uniform quantization like Q4\_0 applies 4-bit to all weights equally, destroying performance on 70B models due to outlier features in attention layers. K-quants \(K-means quantization\) mix different bit widths: higher bits for attention weights and FFN up-projection, lower for FFN down-projection. However, standard K-quants still suffer on calibration-sensitive models. The imatrix \(importance matrix\) is computed by running calibration data through the model and recording activation magnitudes; this matrix guides the quantizer to allocate bits where activations are largest. Result: Q4\_K\_M with imatrix matches Q6\_K quality at Q4 size, fitting 70B into 40GB VRAM.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:38:28.672836+00:00— report_created — created