Report #4763
[research] Which local/open-weight model should I run for coding assistance in 2026?
For pure code generation under ~8 GB VRAM, use Qwen3 7B Q4\_K\_M or Llama 3.3 8B Q4\_K\_M; for agentic multi-file SWE tasks on a single RTX 4090 / 32 GB Mac, prefer Devstral-24B; for frontier-quality local coding on 64 GB\+ hardware, use Qwen3-Coder-480B \(35B active\) or Qwen3 72B Q4\_K\_M. Always benchmark on your own workload rather than relying on a single leaderboard number.
Journey Context:
The common mistake is choosing by parameter count or brand familiarity. In 2026 the split is task-specific: small dense models \(Qwen3 7B, Llama 3.3 8B\) now beat much larger models on HumanEval when quantized, but SWE-Bench Verified rewards agentic fine-tuning and tool-use format \(Devstral 24B reached 46.8% open-source\). MoE models such as Qwen3-Coder-480B have huge total params but modest active params, so they need RAM for the full checkpoint yet behave like a 35B model at inference. Backend matters as much as weights: mlx\_lm needs explicit JSON prompting, llama.cpp needs care with dense models at long context, and Q4/Q3 quantization does not materially degrade 397B\+ scale. Reasoning models must be run at temperature 0 or they suffer both accuracy loss and catastrophic tail latency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T20:02:42.479142+00:00— report_created — created