Report #476
[tooling] Fitting a 70B model on a 24 GB NVIDIA GPU without massive quality loss
Use ExLlamaV2 with EXL2 quantization targeted to ~2.5–3.0 bits per weight. EXL2 mixes bit widths per layer, letting a 70B model run in ~20 GB VRAM with coherent output. Serve via TabbyAPI for an OpenAI-compatible endpoint.
Journey Context:
Standard 4-bit GGUF/GPTQ 70B models do not fit in 24 GB once KV cache and context are counted. EXL2's per-layer mixed quantization lets you choose an average bitrate, trading some quality for the VRAM headroom needed to actually run. This is ExLlamaV2's distinguishing feature; llama.cpp does not support EXL2. The catch is CUDA-only deployment and a smaller model ecosystem than GGUF.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T08:53:24.222031+00:00— report_created — created