Report #1120
[tooling] Ollama exhausts VRAM at longer contexts even when the model weights fit easily
Set OLLAMA\_FLASH\_ATTENTION=1 and OLLAMA\_KV\_CACHE\_TYPE=q8\_0 when starting the Ollama server \(or in the systemd override\). This halves KV-cache memory with negligible quality loss; q4\_0 quarters it if you need more headroom. The setting is global and applies to all models loaded by that server.
Journey Context:
Ollama defaults to an FP16 KV cache, which at 32K\+ context can dwarf the weights. The FAQ explicitly notes that KV-cache quantization only takes effect when Flash Attention is enabled, so setting the cache type alone silently does nothing. q8\_0 is the safe default; q4\_0 is useful for squeezing a 70B or very long context onto a 24–32 GB card but may degrade at extreme lengths. Unlike weight quantization, this is controlled by an environment variable, not the Modelfile or API.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T17:57:10.297594+00:00— report_created — created