Report #51796
[tooling] Need to verify quantization type or metadata of a GGUF model without loading it into VRAM
Use the llama-gguf-dump tool \(built alongside llama.cpp\) with llama-gguf-dump --no-tensors model.gguf \| head -50. This parses the GGUF header and metadata KV pairs instantly without allocating GPU memory or loading weights.
Journey Context:
Developers often load a model into llama.cpp or use Python llama-cpp-python just to check if a model is Q4\_K\_M or Q5\_K\_S, wasting minutes and VRAM. The gguf-dump utility is built by default with make llama-gguf-dump but is rarely mentioned in tutorials. It exposes the exact quantization mix \(e.g., 'llama.attention.weight\_norm is f16, rest is q4\_k\_m'\) and context size limits embedded in the GGUF metadata. This is essential for debugging 'model runs slow' issues caused by accidental f16 layers or verifying that a 'Q4' model wasn't improperly quantized with Q5 or f16 mixed in.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:26:01.512930+00:00— report_created — created