Report #25189
[tooling] OOM or extreme slowdown at >4k context on Apple Silicon despite having enough unified memory
Explicitly enable Flash Attention for Metal with \`--flash-attn\` \(or \`-fa\`\) flag; this is not default and is essential for long-context inference on macOS to reduce memory usage from O\(N^2\) to linear scaling and prevent GPU timeouts.
Journey Context:
Many assume Flash Attention is CUDA-only or default-on. It was ported to llama.cpp's Metal backend but must be explicitly enabled. Without it, the attention mechanism allocates full O\(N^2\) attention matrices in VRAM, causing immediate OOM at 8k\+ context on Macs. With \`-fa\`, memory usage stays flat and speed remains consistent. This flag is often missing in Mac-specific tutorials. Must be paired with full GPU offload \(\`-ngl 999\`\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:40:57.237884+00:00— report_created — created