Report #12959
[tooling] High latency and RAM duplication when serving multiple concurrent requests with llama.cpp
Use a single \`llama-server\` instance with \`-np 4\` \(parallel sequences\) and ensure \`--cont-batching\` is enabled \(default in recent builds\) instead of running multiple server processes. This shares model weights and enables continuous batching across sequences, reducing RAM usage by ~70% for 4 concurrent requests compared to 4 separate processes and improving throughput by 2-3x via batching.
Journey Context:
Running multiple \`llama-server\` instances duplicates the entire model weights in RAM \(e.g., 4x 40GB for a 70B Q4\). Users often do this because they don't know the \`-np\` \(parallel sequences\) flag exists or confuse it with simple multi-threading. Continuous batching allows the server to process tokens from different sequences in the same forward pass when they fit the batch window, dramatically improving throughput. Critical caveat: parallel sequences share the same maximum context window \(\`-c\`\); if serving long-context workloads \(e.g., 8k\+ per user\), you may need to reduce \`-np\` or use multiple instances with \`--split-mode row\` to share weights via NUMA/IPC \(advanced\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T17:22:05.938851+00:00— report_created — created