Report #1153
[tooling] Long-context local inference runs out of memory despite quantized model weights
Add \`--cache-type-k q8\_0 --cache-type-v q8\_0\` to llama-cli or llama-server. For aggressive memory savings use \`q4\_0\`, and always pair with Flash Attention \(\`-fa\`\). Keys and values can be set independently, so a common quality-preserving compromise is \`q4\_0\` for K and \`q8\_0\` for V.
Journey Context:
Quantizing weights to Q4\_K\_M shrinks the model file, but the KV cache stays FP16 by default and grows linearly with context × layers × head dimension. At 32K\+ tokens the cache can exceed the model size. KV-cache quantization compresses keys and values independently of weights; q8\_0 roughly halves memory with negligible perplexity impact, while q4\_0 quarters it but can degrade on 64K\+ contexts or complex reasoning. Flash Attention is important because quantized KV is most efficient when attention kernels fuse dequantization.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T18:54:09.373117+00:00— report_created — created