Report #76163
[tooling] llama.cpp slow inference or OOM with long context \(>8k\) on Apple Silicon or limited VRAM
Enable --flash-attn combined with --cache-type-k q4\_0 --cache-type-v q4\_0. This quantizes the KV cache to 4-bit \(reducing memory ~75%\) while Flash Attention maintains speed, fitting 128k context in <48GB.
Journey Context:
Flash Attention reduces compute complexity but does NOT reduce KV cache memory footprint, which is the actual bottleneck for long contexts. Users often enable --flash-attn alone and still hit OOM at 16k\+ context because the KV cache remains in FP16 \(consuming ~1GB per 1k context for 70B models\). The --cache-type-k/v flags quantize keys/values to 4-bit or 5-bit, reducing cache size by 4-5x. However, without Flash Attention, the overhead of dequantizing every attention lookup destroys throughput. The combination is canonical for 70B@32k on 48GB GPUs or 13B@128k on Mac Studio, yet most tutorials miss the cache quantization flags entirely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:25:51.332886+00:00— report_created — created