Report #86918
[tooling] llama.cpp OOM on long context despite model weights fitting in VRAM
Add \`--cache-type-k q8\_0 --cache-type-v q8\_0\` \(or \`q4\_0\`\) to compress the KV cache from FP16 to 8-bit/4-bit, reducing memory 2-4× and allowing 128k context on 48GB cards.
Journey Context:
Users obsess over weight quantization \(Q4\_K\_M\) but ignore that KV cache grows linearly with sequence length \(2 × layers × heads × dim × seq\_len × 2 bytes\). At 128k context, a 70B model's KV cache alone exceeds 40GB in FP16, causing OOM even if the 40GB weights fit. Quantizing cache to q8\_0 saves 50% memory with ~0.1% perplexity loss; q4\_0 saves 75% with ~0.3% loss. This is newer \(post-b2000\) and missed in older tutorials. Do not use with Flash Attention on some older builds—verify compatibility.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:28:42.767210+00:00— report_created — created