Report #41236
[tooling] Q4\_K\_M quantized model still consumes excessive VRAM, as if running f16
Use the \`gguf-py\` package's \`gguf-dump.py\` script to inspect tensor types: \`python -m gguf.scripts.gguf\_dump --tensors model.gguf \| grep -E '\(name\|dtype\)'\`. Look for stray \`F16\` or \`F32\` tensors \(often \`output.weight\` or \`tok\_embeddings\`\) in a file labeled Q4\_K\_M. Re-quantize with specific tensor type overrides \(e.g., \`--output-tensor-type q6\_k\`\) to fix, reducing VRAM to expected levels.
Journey Context:
Standard quantization \(Q4\_K\_M\) leaves critical tensors like \`output.weight\` and \`tok\_embeddings\` in f16 or f32 for quality preservation, which can consume 20-30% of total model memory. Users see 'Q4\_K\_M' and expect 4-bit average, but don't realize some tensors are full precision. The \`gguf-py\` package \(included in llama.cpp repo\) provides inspection tools that reveal the actual tensor dtypes. The \`gguf\_dump.py\` script \(installable via \`pip install gguf\` or run from repo\) can list all tensors and their types. If \`output.weight\` is F16 in a 70B model, that's ~20GB VRAM right there. The fix is to re-quantize with explicit overrides: \`llama-quantize --output-tensor-type q6\_k\` \(or q5\_k\) to compress those final tensors, trading minimal quality for huge VRAM savings. Common mistake: assuming the filename 'Q4\_K\_M' means all tensors are 4-bit.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T23:41:12.727287+00:00— report_created — created