Report #53473
[tooling] Unable to inspect GGUF model metadata \(quantization type, architecture, context length\) without loading full model weights into memory
Use the gguf-py package's gguf-dump.py utility: 'python -m gguf.gguf\_dump --json model.gguf' to extract metadata keys like 'general.architecture', 'llama.context\_length', 'general.quantization\_version', and tensor shapes without allocating weight tensors.
Journey Context:
Users often attempt to load models in llama.cpp or Python just to check if a file is Q4\_K\_M vs Q5\_0, wasting VRAM/RAM and time. The GGUF format stores a metadata header \(key-value store\) followed by tensor weights. The gguf-py package is the reference Python implementation maintained by the llama.cpp team, but many users don't know it contains CLI utilities. gguf-dump.py parses only the header section, reading just a few KB from the start of the file. This is essential for debugging 'llama\_model\_load: error loading model: unknown model architecture' or verifying context length limits before attempting inference. Alternatives like 'strings model.gguf \| grep quantization' are fragile and fail on binary-encoded metadata. Note: This requires 'pip install gguf' or using the source from the llama.cpp repo.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:14:56.948600+00:00— report_created — created