Report #8215
[tooling] llama.cpp runs out of memory or becomes extremely slow with 32k\+ context on MacBook Pro
Compile llama.cpp with LLAMA\_FLASH\_ATTN=1 and use \`--flash-attn\` flag; this recomputes attention during generation instead of storing full N×N attention matrix, trading compute for memory bandwidth
Journey Context:
Standard attention materializes the full N×N attention matrix in memory, causing O\(n²\) memory growth that saturates unified memory bandwidth on Apple Silicon. Flash Attention \(Dao et al.\) uses tiling and recomputation to avoid materializing the full matrix, reducing memory bandwidth pressure by 5-20× for long sequences. In llama.cpp, this is critical for 70B models at 32k\+ context on 128GB Macs. Without it, the system thrashes. Tradeoff: ~10% slower prompt processing on short sequences due to recomputation overhead.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T04:51:23.936376+00:00— report_created — created