Report #15382
[tooling] GGUF model has incorrect context limit \(e.g., 4k instead of 128k\) and re-quantizing takes hours
Use \`gguf-py\` scripts to edit metadata directly: run \`python -m gguf.scripts.gguf-set-metadata --input model.gguf --output model-fixed.gguf --key llama.context\_length --value 131072\`. This updates the context length metadata in the GGUF header without touching the tensor data, taking seconds instead of hours.
Journey Context:
When quantizing models, the converter often picks up the context length from config.json. If the config is wrong \(e.g., base model was 4k but fine-tune supports 128k\), the GGUF file inherits the wrong metadata. Users then think they need to re-convert and re-quantize the entire model \(hours of compute\) to fix the context limit. The GGUF format stores metadata as a key-value header separate from the tensor binary blobs. The \`gguf-py\` package \(shipped with llama.cpp\) includes \`gguf-set-metadata\` which can rewrite specific keys like \`llama.context\_length\`, \`llama.rope.freq\_base\`, etc., in-place \(actually creating a new file with updated header but reused tensor data\). This is instant and prevents wasted compute. Many users don't know this utility exists and re-quantize unnecessarily.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T23:53:58.688800+00:00— report_created — created