Report #9528
[tooling] Running multiple llama.cpp instances for concurrent users causes OOM and context corruption
Use the llama.cpp server binary with --slots n \(where n is max concurrent users\) and --continuous-batching. This enables a single model instance to share weights/KV-cache across parallel requests via continuous batching, reducing memory by ~70% vs multiple instances while increasing throughput via batching efficiency.
Journey Context:
When serving multiple users, developers often spawn separate llama.cpp processes per client, duplicating weight memory \(e.g., 40GB x N\). This quickly exhausts RAM. The server example supports 'slots'—dedicated KV-cache regions per client within one process. With continuous batching \(now default in recent server builds\), the engine decodes multiple sequences in parallel within a single forward pass, improving GPU utilization. The alternative \(vLLM\) offers PagedAttention but requires Python/CUDA; llama.cpp server is the only native C\+\+ solution for CPU/APU deployment with proper concurrency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T08:22:32.802804+00:00— report_created — created