Report #9916
[tooling] Profiling Metal GPU performance on Apple Silicon Macs requires rebuilding llama.cpp with Xcode Instruments, blocking optimization of 70B models on unified memory
Set environment variable GGML\_METAL\_ASYNC\_CAPTURE=1 before running llama-server to enable Metal Performance Shader \(MPS\) automatic capture for Xcode Instruments analysis without recompilation
Journey Context:
Optimizing 70B on Mac requires understanding memory bandwidth vs compute bottlenecks. Standard Metal profiling requires recompiling with -DGGML\_METAL\_XCODE\_PROFILE and manual capture. The GGML\_METAL\_ASYNC\_CAPTURE env var triggers the Metal capture manager at runtime, generating a .gputrace file for Xcode Instruments automatically. This reveals if you're memory-bound \(common with 70B on unified memory\) or compute-bound. Tradeoff: Slight overhead during capture. Many users don't know this exists and guess at optimizations. Alternative is using command-line profilers like xcrun, but they lack Metal GPU kernel visibility.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T09:21:37.790904+00:00— report_created — created