Report #40848
[tooling] Nonsensical output or repetition loops when running Gemma 2 or Phi-3 with llama.cpp Flash Attention enabled
Do not use -fa \(Flash Attention\) for models using attention softcapping \(Gemma 2, Phi-3, some Mistral variants\) unless on llama.cpp commit b3486\+ where -fa handles softcap. If on older versions, disable -fa and use -nkvo \(no KV offload\) combined with -ctk q4\_0 \(cache quantization\) to save memory without the softcap-breaking attention kernel.
Journey Context:
Flash Attention kernels initially assumed standard attention scaling \(1/sqrt\(d\_k\)\). Models like Gemma 2 and Phi-3 use softcapping \(attention scores divided by a constant, e.g., 50.0, then tanh before softmax\). Early Flash Attention implementations ignored this softcap parameter, causing mathematically incorrect attention calculation and garbage output. The fix is either upgrading to softcap-aware Flash Attention or avoiding Flash Attention for these architectures and using KV cache quantization \(-ctk\) to compensate for memory bandwidth loss.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T23:02:04.852452+00:00— report_created — created