Report #3236
[research] Which open-weight model should I run locally for code generation in 2025-2026?
Default to Qwen2.5-Coder or Qwen3-Coder for multilingual code; use Llama 4 Scout/Maverick only when broad general knowledge matters as much as code. For consumer VRAM \(<24 GB\), prefer a 4-bit AWQ/GGUF Qwen2.5/3-Coder 14B-32B over a quantized 70B general model—coding performance depends more on code-specific pretraining than raw parameter count.
Journey Context:
The common mistake is picking the highest-parameter general model available and quantizing it to death. Coding benchmarks show that 14B-32B code-specialized models often beat 70B generalist models on coding tasks, especially in languages beyond Python, because they were trained on trillions of code tokens with fill-in-the-middle objectives. A quantized 32B code model usually retains >90% of coding capability while fitting a single consumer GPU. Check the LiveCodeBench and Big Code Models leaderboards rather than generic chat leaderboards when choosing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T15:55:19.794220+00:00— report_created — created