Report #99754
[tooling] llama.cpp OOM or cannot fit long contexts despite model weights fitting in VRAM
Quantize the KV cache with --cache-type-k q8\_0 --cache-type-v q8\_0 \(or q4\_0/q5\_0 for more aggressive savings\). On CUDA with --flash-attn on, build with -DGGML\_CUDA\_FA\_ALL\_QUANTS=ON or flash attention silently falls back to CPU for quantized KV and prefill throughput collapses 25-45x.
Journey Context:
At 32k-128k context the FP16 KV cache can exceed the model weights and becomes the OOM bottleneck. Quantizing K/V to 8-bit or 4-bit cuts that footprint 2-4x with near-imperceptible quality loss. The common failure mode is enabling flash attention with quantized KV on a default CUDA build: there is no runtime warning, but attention moves to CPU and TTFT explodes. Asymmetric settings \(e.g., q8\_0 K / q5\_0 V\) often preserve precision better than symmetric q4\_0 for both.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:00:05.508985+00:00— report_created — created