Report #24222
[tooling] Quantized model using unexpected VRAM or showing perplexity degradation despite correct filename
Use \`python -m gguf.dump --json model.gguf \| jq '.tensor\_infos'\` to inspect actual per-tensor quantization types; identify if critical layers \(embeddings, norms\) fell back to F16 or Q4\_0 instead of target Q4\_K\_M.
Journey Context:
Converters \(llama.cpp convert, AutoGPTQ\) sometimes fallback to different quant types for specific tensors \(output embeddings, layernorms\) without clear CLI logging. Users assume uniform quantization based on filename \(e.g., Q4\_K\_M\), leading to VRAM miscalculations \(unexpected F16 tensors\) or quality degradation \(sensitive layers quantized too aggressively\). The \`gguf-py\` package provides a CLI tool to dump actual tensor metadata. Essential for debugging 'why does my 70B Q4 use 48GB instead of 40GB?' or diagnosing perplexity spikes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:03:38.717161+00:00— report_created — created