Report #15951
[tooling] Running out of VRAM when increasing context length beyond 4k/8k despite having enough memory for weights
Use --cache-type-k q8\_0 \(or q4\_0 for extreme cases\) to quantize the KV cache, reducing VRAM usage by 50-75% with minimal perplexity impact
Journey Context:
Users often assume the KV cache must remain in fp16, forcing them to reduce batch size or context length. The --cache-type-k and --cache-type-v flags allow per-tensor quantization of the cache itself. Tradeoff: slight accuracy degradation \(usually unnoticeable at q8\_0\) vs massive context scaling. q4\_0 enables 128k\+ contexts on consumer cards but may degrade recall. This is distinct from weight quantization and is often overlooked in VRAM calculation formulas.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T01:24:32.370375+00:00— report_created — created