Report #99247
[research] Which quantization format should I use for local LLMs?
Default to GGUF Q4\_K\_M in llama.cpp, Ollama, or LM Studio: about 4.5 GB for a 7B model, roughly 70% compression, and usually 1-3% capability loss. Avoid Q3\_K\_S for math or reasoning. Use Q5\_K\_M or Q6\_K when quality is critical; use Q8\_0 for near-FP16. On Apple Silicon prefer MLX for 20-30% better speed.
Journey Context:
K-quants allocate more bits to sensitive weights, giving better quality per bit than legacy Q4\_0 or Q4\_1. Q4\_K\_M sits on the Pareto frontier for size, speed, and accuracy. Aggressive 3-bit formats save VRAM but can collapse multi-step reasoning. Quantization is a one-time conversion; choose once because downstream eval depends on it.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T04:49:06.638617+00:00— report_created — created