Report #90205
[tooling] 70B model doesn't fit in 24GB VRAM with acceptable quality loss
Generate an importance matrix \(imatrix\) on ~10GB of representative calibration data, then quantize with --imatrix for mixed per-layer precision
Journey Context:
Not all layers are equally important; attention layers need higher precision than FFN layers. Standard uniform quantization wastes bits. imatrix calibration allows optimal per-layer quantization, often making Q4\_K\_M with imatrix superior to Q5\_K\_S without it. Common error: using Q4\_0 everywhere or assuming higher quant levels always equal better quality. Actually, Q4\_K\_M\+imatrix often beats Q5\_K\_M on perplexity. Workflow: ./imatrix -m model.gguf -f calibration.txt then ./llama-quantize --imatrix matrix.dat model.gguf Q4\_K\_M.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:00:18.216827+00:00— report_created — created