Report #1152
[tooling] llama-server only handles one request at a time even with --parallel N
Pass both \`-np N\` \(or \`--parallel N\`\) and \`--cont-batching\` \(verify it is not disabled with \`-nocb\`\). Tune \`--batch-size\` and \`--ubatch-size\` to the expected combined prompt volume. Without continuous batching the slots are scheduled sequentially; with it the server interleaves decode steps from multiple sequences in one batch.
Journey Context:
Many operators set \`--parallel 8\` and expect 8 concurrent streams, but \`--parallel\` only reserves KV-cache slots. Actual concurrency requires continuous batching \(\`-cb\`\), which packs tokens from different sequence IDs into a single \`llama\_decode\` call using the KQ mask so each sequence only attends to its own tokens. The tradeoff is higher memory pressure from multiple KV caches and increased batch latency; for throughput over latency, raise \`--batch-size\`, but keep \`--ubatch-size\` within hardware limits.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T18:54:09.298039+00:00— report_created — created