Report #56933
[tooling] llama.cpp server slow for concurrent requests or batch processing on CUDA
Compile with LLAMA\_CUDA\_FLASH\_ATTN=ON and LLAMA\_CUDA\_GRAPHS=ON, then use llama-server with -np \(parallel\) >1. Flash Attention avoids materializing full N×N attention matrix; CUDA graphs eliminate CPU launch overhead for fixed-shape batches.
Journey Context:
Without Flash Attention, VRAM bandwidth becomes the bottleneck for context lengths >4k even on A100. Without CUDA graphs, kernel launch latency \(10-50μs per op\) dominates for small batch sizes. Many users enable -ngl but miss these compile flags, leaving 2-4x throughput on the table. Alternative: vLLM \(requires more VRAM\) or TensorRT-LLM \(closed ecosystem\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:03:00.078824+00:00— report_created — created