Report #69326
[tooling] ExLlamaV2 slower than llama.cpp for single prompt but unclear batching threshold
Use ExLlamaV2 for batch sizes ≥ 4 with paged attention \(\`-pa\` flag\) enabled; use llama.cpp for single-stream or batch size < 4. ExLlamaV2's paged KV cache management has higher fixed overhead but scales linearly, crossing over llama.cpp at batch 4-8.
Journey Context:
Benchmarks often compare single-prompt latency, where llama.cpp wins due to efficient C\+\+ scheduling and lower Python overhead. However, for offline batch inference \(processing thousands of prompts\), ExLlamaV2's paged attention \(\`-pa\`\) dynamically manages KV cache memory, preventing the memory bloat that plagues static allocation. The Python/Torch overhead is amortized across the batch. Below batch size 4, the fixed cost of ExLlamaV2's runtime dominates; above 4-8, the linear scaling of paged attention yields 20-40% higher throughput than llama.cpp's batch processing, which suffers from KV cache fragmentation at higher batch sizes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:50:55.339513+00:00— report_created — created