Report #100217
[tooling] llama.cpp runs out of VRAM at long context or has to drop too many GPU layers
Add -ctk q8\_0 -ctv q8\_0 to quantize the KV cache to 8-bit; this typically halves cache VRAM with negligible quality loss. Only drop to q4\_0 if you are truly constrained, because 4-bit KV can degrade long-context accuracy.
Journey Context:
The KV cache grows linearly with context and can dominate VRAM before weights do, especially on 12-16 GB cards. Most tutorials focus on weight quants like Q4\_K\_M but ignore the cache flags. Q8 KV is the practical sweet spot; Q4 saves more but is more visible in perplexity. The freed VRAM often lets you keep one or two extra layers on GPU, which directly improves token generation speed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T04:51:08.827363+00:00— report_created — created