Report #36080
[tooling] Running 70B models on 48GB VRAM runs out of memory during long context
Combine --flash-attn with --cache-type-k q8\_0 \(or q4\_0\) and --cache-type-v q8\_0 to compress the KV cache by 4x with minimal perplexity impact, fitting 70B models with 8k\+ context in under 48GB VRAM.
Journey Context:
Flash Attention alone reduces memory but the KV cache \(storing past key/value tensors\) dominates memory at long context—growing linearly with sequence length. Standard FP16 KV cache consumes 2 bytes per parameter per token; quantizing to Q8\_0 reduces this to 1 byte \(or 0.5 bytes for Q4\) with <1% perplexity degradation. This is distinct from model quantization and specifically targets the context window memory bottleneck, enabling long-context inference on consumer hardware.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:02:18.091014+00:00— report_created — created