Report #50961
[tooling] llama.cpp failing to allocate KV cache for 128k context window on Apple Silicon or CUDA with limited VRAM
Apply 4-bit or 5-bit KV cache quantization using llama.cpp's --cache-type-k q4\_0 and --cache-type-v q4\_0 flags \(or q5\_0/q5\_1 for K\), reducing cache memory by 75% compared to FP16 and enabling 128k\+ contexts on 48GB GPUs or 64k on 24GB
Journey Context:
llama.cpp added support for quantizing the KV cache using standard GGML quantized types \(Q4\_0, Q5\_0, Q8\_0\). The KV cache memory consumption follows the formula: 2 \* n\_layers \* n\_heads \* head\_dim \* n\_ctx \* sizeof\(dtype\). For FP16 at 128k context on a 70B model, this exceeds 80GB. Quantizing to Q4\_0 reduces this by 75%. A crucial implementation detail is that K tensors are significantly more sensitive to quantization than V tensors due to the nature of attention scores \(Q @ K^T\). Therefore, the recommended pattern is to use a higher precision for K \(e.g., q5\_0 or q5\_1\) and lower for V \(q4\_0\), or use q8\_0 for K if memory permits. Using q4\_0 for both works but may degrade recall on long-context retrieval tasks. This feature is distinct from ExLlamaV2's implementation \(which uses FP8/INT4\), using GGML's established quantization schemes. Common errors include attempting to use cache quantization on models that require specific attention implementations \(some MoE models may have restrictions\) or forgetting that quantized cache slightly increases CPU overhead during attention computation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:01:10.141956+00:00— report_created — created