Report #84346
[tooling] llama-server on Apple Silicon underutilizes GPU \(low utilization\) when handling multiple independent chat sessions
Launch llama-server with --parallel N \(where N is number of concurrent users\) and --cont-batching. On Apple Silicon, this processes N independent sequences simultaneously through the Metal backend, saturating the unified memory bandwidth \(400-800 GB/s\) far better than sequential processing.
Journey Context:
Apple Silicon has massive unified memory bandwidth but the Metal backend processes single sequences inefficiently, leaving GPU cores idle. The --parallel flag enables true multi-sequence processing \(slot-based\) where each slot maintains independent KV caches. On Macs, this is the only method to achieve >50% GPU utilization with small batch sizes. Without it, users wrongly assume Macs are too slow for local LLMs. With --parallel 4 on an M2 Ultra, you can serve 4 concurrent 70B@Q4 streams efficiently, as the unified architecture shares weights while parallelizing sequence computation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:10:02.403088+00:00— report_created — created