Report #86091
[tooling] oobabooga Text Generation WebUI OOM with ExLlamaV2 during long context streaming \(>8k tokens\)
In the WebUI's 'ExLlamaV2' loader tab, enable 'cache\_8bit' \(or 'cache\_q4' for extreme cases\) which quantizes the ExLlamaV2 KV cache to 8-bit \(or 4-bit\). This reduces VRAM usage by ~50% \(or 75%\), preventing OOM during long context generation. This is distinct from the weight quantization \(EXL2\) and is specifically for the cache.
Journey Context:
Users often conflate the EXL2 weight quantization \(4-bit weights\) with the KV cache memory. Even with Q4 weights, a 70B model with fp16 cache uses ~20GB\+ for 8k context, causing 24GB cards to OOM. The WebUI hides this option in the 'ExLlamaV2' loader section, and many users default to 'cache\_fp16'. The 8-bit cache has negligible quality loss \(<0.1 perplexity\) vs fp16. Alternative is switching to llama.cpp in the WebUI, but ExL2 with cache\_8bit is 2-3x faster on NVIDIA. This flag is the difference between running 70B at 8k context on a 4090 vs failing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T03:05:31.954648+00:00— report_created — created