Report #84522
[tooling] Ollama runs out of VRAM with 70B models at 8k context despite Q4 quantization
Set OLLAMA\_KV\_CACHE\_TYPE=q4\_0 \(or q8\_0\) to quantize the KV cache, reducing VRAM usage by ~50-75% with minimal perplexity impact.
Journey Context:
Most users assume they need Q4\_K\_M quantization for the weights, but the KV cache for long contexts often consumes more VRAM than the weights themselves \(2 bytes/token/layer for K\+V\). Ollama added KV cache quantization in 0.3.0, but it's hidden behind an undocumented env var. People often try to reduce context window \(-c\) or use lower weight quantization instead of targeting the cache. q4\_0 is usually indistinguishable from f16 for KV cache in practice.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:27:43.871007+00:00— report_created — created