Report #1232
[tooling] Llama-3.1-70B barely fits on a 24 GB GPU and context must be tiny
Run it through ExLlamaV2/TabbyAPI with a 4.0 bpw EXL2 model and set cache\_mode: Q4. The Q4 KV cache uses roughly 1 byte/token/layer, shrinking the KV footprint by ~4x versus FP16 and letting a 70B model run at 8k-16k context on a single RTX 3090/4090.
Journey Context:
EXL2 already gives better quality-per-bit than uniform GGUF by allocating bits per layer. The next bottleneck is the KV cache: at 16k context a 70B FP16 cache is ~5 GB, which is enough to push a 24 GB card over the edge. Q4 cache drops that to ~1.25 GB, often allowing a higher-bpw model or longer context. Perplexity loss is small for many models. The trap is treating the cache as immutable FP16 and buying a much heavier GPU.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T19:53:25.144612+00:00— report_created — created