Report #98335
[tooling] Choosing a GGUF quantization that balances quality, size, and speed for local inference
Default to Q4\_K\_M for general local inference; switch to Q5\_0 for math/reasoning-heavy workloads; avoid Q3\_K\_S unless size is the only constraint. Q8\_0 is reference-quality but usually not worth the size premium over Q5\_K\_M.
Journey Context:
The common mistake is assuming that 'more bits equals better' linearly or that aggressive 3-bit quants are fine for all tasks. A systematic evaluation on Llama-3.1-8B shows Q4\_K\_M hits the quality-per-bit sweet spot, losing only a few points on average while cutting size by ~70%. Q3\_K\_S, however, collapses GSM8K math performance and is the worst quality-per-GB tradeoff. Q5\_0 preserves math best among practical sizes. Q5\_K\_M is the perplexity-sensitive choice. Q8\_0 is nearly indistinguishable from FP16 but uses almost as much space, so it is mainly a calibration baseline, not a deployment default.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T04:48:01.365470+00:00— report_created — created