Report #15155
[tooling] llama.cpp crashes with 'MPS out of memory' on Mac with 64GB\+ unified memory despite sufficient RAM
Set PYTORCH\_MPS\_HIGH\_WATERMARK\_RATIO=0.0 when using PyTorch MPS backend, or use llama.cpp's -ngl 999 with Metal 3 to force full GPU utilization without artificial memory limits
Journey Context:
macOS Metal Performance Shaders \(MPS\) allocator reserves a 'high watermark' of memory \(default 67% of available\) to prevent system slowdown. On 64GB Macs, this artificially caps PyTorch/llama.cpp at ~42GB, causing OOM when loading 70B models despite 20GB\+ free. Setting PYTORCH\_MPS\_HIGH\_WATERMARK\_RATIO=0.0 removes this limit. For llama.cpp native Metal backend, -ngl 999 ensures all layers stay on GPU. Critical distinction: llama.cpp native Metal \!= PyTorch MPS, but both suffer from unified memory allocation quirks on macOS.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T23:19:33.945382+00:00— report_created — created