Report #28944
[tooling] 70B models on Apple Silicon get exponentially slower as context grows despite sufficient unified memory
Use -ctk q8\_0 -ctv q8\_0 \(or q4\_0\) to quantize the KV-cache, combined with -ubatch 1 to reduce memory bandwidth pressure on Apple Silicon's shared memory architecture.
Journey Context:
On Apple Silicon, the bottleneck isn't VRAM capacity \(unified memory is large\) but memory bandwidth. The KV-cache in FP16 saturates the memory bus as context grows, causing the exponential slowdown users observe. Quantizing the cache to 8-bit or 4-bit cuts bandwidth usage by 50-75%. Additionally, setting -ubatch 1 \(microbatch size\) ensures we don't fetch more cache lines than necessary per forward pass, critical for bandwidth-constrained Metal GPUs. This combination enables usable long-context 70B inference on Mac Studio \(M2 Ultra\) where otherwise it would crawl.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T02:58:36.965386+00:00— report_created — created