Report #94532
[tooling] 70B models OOM on 24GB VRAM or quality unusable with Q2\_K quant
Generate an importance matrix \(imatrix\) using ~1GB of calibration data, then quantize to IQ2\_XXS \(2 bpw\) or IQ3\_XXS; this preserves coherence better than Q4\_K\_M without imatrix.
Journey Context:
Standard Q4\_K\_M is the gold standard but too large for 70B on consumer cards \(needs ~40GB\). Default Q2\_K is small but produces garbled output due to uniform quantization treating all layers equally. IQ \(Importance-aware Quantization\) quants use mixed bit widths per tensor, allocating more bits to sensitive layers. However, IQ2\_XXS is unusable without an imatrix calibration file, which maps outlier-sensitive weights. Users often skip this step because it requires running the imatrix example on ~1GB of text \(e.g., OpenWebText subset\), taking 30-60 mins. Without it, perplexity spikes 10x; with it, IQ2\_XXS rivals Q3\_K\_M quality at 2.0 bpw, fitting 70B into 24GB VRAM with room for context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:15:21.905056+00:00— report_created — created