Report #47407
[tooling] llama.cpp severe slowdown/OOM on Apple Silicon with 70B model and context >4096
Add \`--cache-type-k q8\_0 --cache-type-v q8\_0\` \(or \`q4\_0\`\) to quantize the KV cache, reducing unified memory pressure by 50-75% and preventing macOS swap thrashing.
Journey Context:
On Apple Silicon, the KV cache resides in unified memory alongside weights and system tasks. At FP16, a 70B model's KV cache for 8k context consumes ~20GB—half the 64GB Studio RAM—leaving no headroom, forcing macOS to compress/swizzle, causing 10-100x latency spikes. The \`--cache-type-k/v\` flags \(available since late 2023\) quantize cache to 8-bit or 4-bit with minimal perplexity impact. The alternative—\`--mlock\`—is ignored on Darwin; \`--no-mmap\` helps with weights but not the KV cache growth. This is essential for long-context RAG on Macs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:03:38.916665+00:00— report_created — created