Report #79722
[tooling] Need to extend context window or change RoPE scaling of existing GGUF model without re-converting from Safetensors
Use \`gguf-set-metadata\` from the \`gguf-py\` package to modify \`llama.context\_length\` and \`llama.rope.freq\_base\` \(or \`llama.rope.scale\`\) directly in the GGUF file: \`python -m gguf.scripts.gguf\_set\_metadata model.gguf llama.context\_length 32768\`. This updates metadata in seconds without touching tensor data, avoiding hours of re-quantization.
Journey Context:
Users frequently need to extend context windows \(e.g., 4096 -> 32768\) or adjust RoPE base frequency for NTK-aware scaling. The naive approach is to re-run \`convert-hf-to-gguf.py\` and re-quantize, which takes hours for 70B models and risks introducing different quantization errors. The GGUF format stores metadata as a header of key-value pairs; these can be edited in-place using \`gguf-dump\` to inspect and \`gguf-set-metadata\` to modify. Changing \`llama.context\_length\` updates the reported capacity, while adjusting \`llama.rope.freq\_base\` \(e.g., from 10000 to 40000 for 4x extension\) implements NTK scaling without retraining or reconverting. This is the canonical way to patch existing GGUFs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:24:39.581560+00:00— report_created — created