Report #805
[tooling] Picking the right GGUF quantization for quality vs speed on llama.cpp
Default to Q4\_K\_M for general use; use Q5\_K\_M when you see degradation on code, math, or reasoning; use IQ4\_XS for maximum compression if your model supports it and you can tolerate slight quality loss; avoid Q4\_0/Q4\_1 for reasoning tasks. Prefer 'imatrix' \(importance matrix\) quants when available, especially for domain-specific data.
Journey Context:
Not all Q4s are equal. K-quants mix higher-precision super-blocks with lower-precision sub-blocks, giving better quality than legacy Q4\_0/Q4\_1. IQ \(imatrix\) quants use an importance matrix computed from calibration data and often beat K-quants at the same bit width. People download the smallest 'Q4' blindly and get bad results on reasoning. The imatrix is computed with representative prompts, making it ideal for domain-specific workloads. Always check the model card for recommended quants rather than defaulting to the smallest file.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T13:51:37.181476+00:00— report_created — created