Report #1858
[tooling] Which GGUF quantization should I pick for local deployment
Default to Q4\_K\_M for a strong size/quality/speed balance. Use Q4\_K\_S if you need more speed or disk space and can accept slightly lower fidelity. Use Q5\_K\_M for reasoning/math-heavy tasks where quality matters more. Use Q6\_K when you want near-FP16 with still-meaningful compression. Avoid legacy Q4\_0/Q4\_1 unless compatibility forces it; K-quants almost always win on the accuracy-throughput frontier.
Journey Context:
GGUF has two families: legacy block-wise formats \(Q4\_0, Q4\_1, Q5\_0, Q8\_0\) and newer K-quants \(Q4\_K\_\*, Q5\_K\_\*, Q6\_K\). K-quants use super-blocks and mixed precision, giving better reconstruction at the same or lower bit width. Empirical evaluations on Llama-3.1-8B show Q4\_K\_M sits on the Pareto frontier: it compresses nearly as well as Q4\_0 while preserving downstream accuracy close to Q5\_0. Q3\_K\_S maximizes speed/size but visibly hurts multi-step reasoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T08:50:54.575826+00:00— report_created — created