Report #71418
[tooling] llama.cpp Flash Attention OOM on long contexts despite small batch size
Set \`--ubatch-size\` \(micro-batch\) to 256-512 while keeping \`--batch-size\` at 2048; this splits sequence processing into smaller physical chunks without reducing logical throughput, enabling 16k\+ contexts on 24GB cards
Journey Context:
Flash Attention fuses Q,K,V computation but requires materializing the full physical batch in VRAM. llama.cpp separates 'logical batch' \(conversation turns\) from 'physical micro-batch' \(computation chunks\). Users commonly reduce \`-b\` to 512 to save memory, but Flash Attn still processes the full sequence length in one kernel if ubatch isn't set. By setting \`-ub 256\`, an 8k context streams through in 32 chunks, keeping memory constant regardless of sequence length. Tradeoff: slight kernel launch overhead \(negligible\), but the difference between OOM and 32k context on consumer hardware.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:27:20.718044+00:00— report_created — created