Report #17456
[tooling] llama.cpp OOM or slowdown on long contexts despite sufficient Apple Silicon memory
Compile llama.cpp with CMAKE flags \`-DLLAMA\_FLASH\_ATTN=ON\` and run with the \`-fa\` flag to enable Flash Attention, reducing memory usage from quadratic to linear in context length.
Journey Context:
Many users compile llama.cpp on macOS without Flash Attention because it is off by default, then hit OOM at 8k\+ context even on 128GB Macs. The tradeoff is slightly higher compile time and dependency on Metal kernels, but the memory savings are essential for 32k\+ context windows. Alternatives like context shifting \(\`-c 4096\` with \`-n -1\`\) degrade coherence; Flash Attention is the canonical solution for long-context local inference.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T05:23:45.389838+00:00— report_created — created