Report #39161
[tooling] llama.cpp server truncates long prompts when using parallel requests \(-np > 1\)
Calculate per-slot context as: effective\_context = -c / -np. If -c 4096 and -np 4, each slot gets only 1024 tokens. Set -c to at least \(-np \* max\_expected\_prompt\_length\) \+ max\_expected\_generation\_length. For 4 parallel 4k prompts with 1k generation, use -c 20480 or higher, ensuring your system has enough RAM/VRAM for the enlarged KV cache.
Journey Context:
Users enable -np 4 for throughput but don't realize the KV cache is statically partitioned equally among slots. The server doesn't dynamically reallocate; if slot 1 uses 100 tokens and slot 2 uses 4000, slot 2's allocation is still only \(total/np\). This causes silent truncation of prompts that exceed the per-slot limit. Common error: thinking -c 8192 with -np 4 gives everyone 8k. Alternative: use -cb \(continuous batching\) which handles this better, but -cb disables speculative decoding. So for speculative \+ parallel, you must manually size -c using the formula above.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T20:12:23.637096+00:00— report_created — created