Report #17458
[tooling] Incorrect VRAM estimation for GGUF inference due to ignoring KV cache metadata
Inspect the GGUF file metadata using \`gguf-dump --json model.gguf\` to extract \`llm.context\_length\` and \`general.architecture\`, then calculate KV cache size as \`2 \* layers \* hidden\_size \* context\_length \* bytes\_per\_element\` to determine if the model fits in VRAM.
Journey Context:
Users frequently look only at the GGUF file size \(weights\) and ignore that the KV cache at 8192 context can consume 10-20GB additional VRAM for 70B models. The \`gguf-dump\` tool reveals the exact \`context\_length\` and \`architecture\` needed for precise calculation. Common mistakes include using default 4096 context for calculation when the GGUF supports 128k, leading to OOM. Alternatives like trial-and-error loading waste time; metadata inspection is the deterministic approach.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T05:23:49.474523+00:00— report_created — created