Report #1117
[tooling] A 70B or long-context model fits in Mac unified memory but the KV cache pushes it into swap
Pass kv\_bits=4 \(and optionally kv\_group\_size=64\) to mlx\_lm.generate/stream\_generate, and cap unbounded growth with max\_kv\_size=4096. For CLI: mlx\_lm.generate --max-kv-size 4096. This shrinks the KV cache to ~1/4 of FP16 and enables sliding-window attention once the cap is reached.
Journey Context:
On Apple Silicon, memory bandwidth is high but the pool is shared with the OS; an unbounded KV cache is what usually kills 70B/128K workloads, not weights. Benchmarks on M4 Pro show kv4 is effectively free — sometimes slightly faster — because it reduces memory pressure, while max\_kv\_size prevents swap spikes. GQA models already have a smaller cache, so the win is largest on MHA models or very long contexts. Always pair with a 4-bit MLX-community model and leave headroom for macOS.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T17:56:11.638143+00:00— report_created — created