Report #100220
[tooling] ExLlamaV2/TabbyAPI OOMs or cannot fit a 70B model with useful context on a 24 GB GPU
Set cache\_mode: Q4 in TabbyAPI's config.yml \(or instantiate ExLlamaV2Cache\_Q4 in Python\) to cut KV-cache memory to roughly 25% of FP16. Pair it with a 4.0 bpw EXL2 quant; this is the practical recipe for Llama 3.1 70B at 8K context on a single RTX 3090/4090.
Journey Context:
EXL2 gives better quality-per-bit than uniform 4-bit, but a 70B still barely fits. The KV cache is the remaining lever: at 16K context it can exceed 5 GB in FP16. Q4 cache trades a small perplexity penalty for enough headroom to keep the model fully on GPU. TabbyAPI exposes this as cache\_mode; in the Python API use ExLlamaV2Cache\_Q4 with lazy=True for autosplit.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T04:51:53.907900+00:00— report_created — created