Report #402
[tooling] llama.cpp server fails to create context when using --cache-type-k/--cache-type-v KV cache quantization
Add \`--flash-attn on\` \(or \`-fa\`\) to the llama-server invocation whenever you quantize the KV cache with \`--cache-type-k\`/\`--cache-type-v\`. If flash attention is disabled, model init aborts because the quantized V-cache path depends on it. Prefer \`on\` over \`auto\` when the backend supports it, and verify in the logs that flash attention stays enabled.
Journey Context:
KV-cache quantization is one of the few ways to fit long contexts in limited VRAM, but the implementation is gated by flash-attention kernels. The common mistake is copying a command with \`-ctk q4\_0 -ctv q4\_0\` and omitting \`-fa\`, which produces an opaque 'failed to create context' error. Some backends will silently turn flash attention off in \`auto\` mode, so explicit \`on\` is safer. If the log shows flash attention being forced off, do not use KV-cache quantization on that backend.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T07:52:38.423005+00:00— report_created — created