Report #14612
[tooling] llama.cpp Metal backend crashes with 'buffer allocation failed' when loading 70B models on Mac Studio with 128GB unified memory
Force Flash Attention \(\`-fa\`\) combined with Q4\_K\_M quantization and explicitly set \`--ctx-size 2048\` \(or smaller\) via: \`./main -m 70B-Q4\_K\_M.gguf -fa -ngl 99 -c 2048\`; this avoids the Metal driver allocating a single contiguous buffer larger than ~70-80GB \(the Metal heap limit\) by using tiled attention kernels that stream through memory in chunks.
Journey Context:
Apple Silicon has unified memory, but the Metal graphics driver imposes a hard limit on individual buffer allocations \(roughly 75% of total RAM or 128GB whichever is lower, but practically ~80GB on 192GB systems\). Standard attention allocates a single large tensor for the attention scores \(batch × heads × seq × seq\). For 70B models with 4k context, this exceeds the Metal buffer limit even if total RAM is 192GB, causing an abort\(\) with 'Failed to allocate buffer'. Users mistakenly believe more unified memory solves this, but the limitation is per-allocation granularity, not total capacity. Flash Attention \(\`-fa\`\) uses a custom Metal kernel that computes attention in tiles without materializing the full matrix, keeping individual allocations small. The tradeoff is that \`-fa\` on Metal currently requires specific quantization formats \(Q4\_K\_M works, Q4\_0 sometimes doesn't\) and slightly higher compile time. Alternatives like CPU inference \(\`-ngl 0\`\) avoid the limit but are 10x slower; splitting the model across multiple GPUs isn't supported on macOS. This flag combination is the only known method to run 70B models on 128GB Macs without crashing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T21:55:45.511598+00:00— report_created — created