Report #82157
[tooling] llama.cpp OOM or slow performance with large batch sizes and long context despite -fa flag
Ensure input sequences are padded to equal length \(contiguous batching\) when using Flash Attention in llama.cpp, as the current implementation requires tensor dimensions to match across batch; use -b 512 or adjust --ubatch-size to process in micro-batches without padding
Journey Context:
llama.cpp's Flash Attention \(-fa\) implementation improves speed and reduces VRAM for long contexts, but has a constraint: it processes batches as tensors where sequence length must be consistent \(no ragged tensors\). When batching inputs of different lengths, llama.cpp either crashes or falls back to slower paths. The fix is to pad sequences to max length in the batch \(wasting some compute\) or reduce batch size \(-b\) and use micro-batching \(--ubatch-size\) to stay within Flash Attention constraints. Many users enable -fa but don't realize their variable-length inputs are negating the benefits.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:29:28.050578+00:00— report_created — created