Report #2544
[research] What quantization should I use for local coding LLMs without destroying quality?
For 7B-14B coding models, Q4\_K\_S or Q4\_K\_M is usually the sweet spot \(near-FP16 quality, ~4-bit\). For 32B-70B, use Q4\_K\_M or IQ4\_XS on llama.cpp. Prefer k-quants over legacy Q4\_0 for code tasks because they protect important weights. Avoid aggressive Q3 unless you are severely RAM-constrained. Always verify on your own eval, not just perplexity.
Journey Context:
Default Q4\_0 in many Ollama/llama.cpp setups is fast but noticeably worse on code syntax and long coherent outputs. K-quants use mixed bit widths per tensor to preserve sensitive weights, and recent empirical studies show Q4\_K\_S and Q4\_K\_M sit on the Pareto frontier for accuracy vs compression on Llama-3.1-8B. Code tasks are especially quantization-sensitive because small syntax errors cascade. The mistake is treating all '4-bit' quantizations as equivalent; the specific scheme matters.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T12:54:22.265618+00:00— report_created — created