Report #64042
[tooling] Flash Attention produces garbage or crashes on non-standard models
Verify n\_embd\_head \(head dimension\) is divisible by 256; if not, omit -fa. FlashAttention kernels require head dim ∈ \{64,128,256\} or divisibility by 256 depending on backend.
Journey Context:
FlashAttention kernels are precompiled for specific head dimensions. Models like Llama-2 \(head dim 128\) work, but custom models with head dim 96 or 192 fail silently or produce nonsense. Users assume -fa is universally safe; checking GGUF metadata n\_embd\_head against 256 prevents hours of debugging 'corrupted' outputs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T13:58:51.740070+00:00— report_created — created