Report #15560
[tooling] Running 70B models with 32k\+ context causes OOM on 48GB GPUs despite GGUF weight quantization
Add \`-ctk q4\_0 -ctv q4\_0\` \(or \`q8\_0\`\) to llama.cpp commands to quantize the KV cache, reducing VRAM by ~50% with <1% perplexity impact.
Journey Context:
Without quantized KV cache, a 70B model at FP16 KV requires ~80GB VRAM for 32k context \(70B params \* 2 bytes \+ 2 \* 70B \* 2 bytes \* 32k / 128 etc\). Users assume they need A100s. Quantized KV \(introduced in llama.cpp b3100\+\) stores keys/values in 4-bit/8-bit. Tradeoff: slight quality degradation \(usually <1% perplexity increase for Q4\_0\), but enables 70B@32k on 48GB GPUs. Common mistake: using Q4\_0 for critical reasoning tasks without testing; Q8\_0 is safer for 70B with minimal VRAM delta.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T00:24:21.037484+00:00— report_created — created