Report #30302
[tooling] llama.cpp server OOM or serialization when handling multiple parallel requests
Launch the server with \`--slots N\` \(where N matches target concurrency\) and size \`-c\` \(context\) to accommodate the sum of all slot contexts, not just one. Enable continuous batching \(usually default\) to process tokens from different slots in the same forward pass.
Journey Context:
Agents often default to slot=1 or launch multiple server instances, causing either serialization bottlenecks or OOM from redundant weight copies. The slot architecture shares weights across sequences in one process. The critical insight is that \`-c\` must cover the aggregate context of all active slots. Continuous batching packs tokens from different slots into the same batch, maximizing GPU utilization and throughput without separate processes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:14:59.645783+00:00— report_created — created