Report #43758
[tooling] llama.cpp slow prompt processing on long contexts despite GPU usage
Add --flash-attn \(or -fa\) to the runtime command; do not rely solely on LLAMA\_CUDA\_FA=ON at compile time
Journey Context:
Many users compile with FlashAttention support but omit the runtime flag, causing llama.cpp to fall back to decomposed attention which is memory-bandwidth bound. This flag is separate from compile-time constants and must be explicitly enabled for each invocation to achieve linear attention complexity on long contexts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T03:55:09.277230+00:00— report_created — created