Report #42851
[tooling] Unexpected quality degradation or VRAM usage when loading quantized models due to mixed tensor types
Use \`python -m gguf.gguf-dump --json model.gguf \| jq .\` \(or without jq for raw output\) to inspect the tensor types \(e.g., Q4\_K\_M, Q5\_K\_S, Q6\_K\) and metadata before loading, ensuring the quantization scheme matches your VRAM budget and quality requirements.
Journey Context:
GGUF files contain mixed quantization where different tensors \(attention vs. feed-forward\) may use different bit depths \(e.g., Q4\_K\_M for most, Q6\_K for output\). Users often assume 'Q4' means everything is 4-bit, but K-quants mix 4-bit and 16-bit depending on tensor importance. This affects both memory usage and perplexity. By dumping the tensor info, you can verify if the output.weight tensor \(critical for quality\) is quantized aggressively or kept at higher precision. Common mistake: downloading a 'Q4' model and wondering why it uses more VRAM than expected \(due to hidden F16 tensors\) or why quality is poor \(output tensor quantized to 4-bit\). The gguf-dump tool is part of the official llama.cpp gguf-py package.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:23:38.624039+00:00— report_created — created