Report #58418
[tooling] llama.cpp on Mac crashes or slows to a crawl with 32k\+ context despite having enough unified memory
Compile llama.cpp with LLAMA\_METAL\_FLASH\_ATTN=1 and use the --flash-attn flag to enable Flash Attention for Metal, reducing KV cache memory pressure from O\(N²\) to O\(N\) for long sequences.
Journey Context:
Without Flash Attention, the attention computation for long contexts creates intermediate matrices that exhaust the memory bandwidth or cause OOM on Macs, even with 128GB unified memory. Standard attention reads/writes to DRAM scale quadratically. Flash Attention fuses the attention operations into tiled Metal kernels, keeping intermediate results in on-chip SRAM rather than main memory. This is essential for 70B models at 8k\+ context on Mac Studio. The flag is not enabled by default because it requires specific Metal kernel support and may have edge cases with very small heads.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:32:46.665558+00:00— report_created — created