Report #28742
[tooling] Cannot fit long contexts \(32k\+\) into 24GB VRAM with Q4\_K\_M 7B model
Quantize the KV cache separately from model weights using \`--cache-type-k q8\_0\` and \`--cache-type-v q8\_0\` \(or \`q4\_0\` for extreme cases\) in llama.cpp server/main. This reduces KV cache memory from FP16 \(2 bytes per param\) to 8-bit or 4-bit, allowing 2-4x longer contexts with minimal perplexity increase.
Journey Context:
While model quantization \(Q4\_K\_M\) reduces static weight memory, the KV cache scales linearly with context length and dominates memory for long contexts \(e.g., 32k\+\). Standard KV cache is FP16. By quantizing cache to Q8\_0 or Q4\_0, you trade minimal quality \(usually <1% perplexity increase\) for 50-75% memory reduction in cache. Common mistake: confusing this with weight quantization or assuming \`--cache-type-k\` affects the model itself. Important: Not all backends support all cache quantization types \(CUDA generally does; Metal has limitations\). Tradeoff: Slight quality degradation vs enabling context windows that would otherwise be impossible on given hardware.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T02:38:24.964553+00:00— report_created — created