Report #11064
[tooling] Extending context length of GGUF requires re-quantizing from FP16
Use \`python -m gguf.scripts.gguf\_set\_metadata model.gguf llama.rope.freq\_base 26000\` \(adjust base per YaRN/NTK formula\) and \`llama.context\_length 32768\`. This patches the GGUF header metadata in-place without touching tensor data, enabling immediate testing of 32k/128k context on existing quants.
Journey Context:
Users assume context extension requires re-quantizing with new RoPE settings, taking hours. GGUF stores hyperparameters in a mutable header keyed by 'llama.\*' names. The \`gguf-py\` package includes \`gguf\_set\_metadata\` to modify these keys directly. The critical insight is calculating the correct \`freq\_base\` \(e.g., using YaRN or NTK-aware scaling laws\) - simply doubling context without adjusting freq\_base causes immediate model breakdown. This workflow saves hours per iteration when searching for the optimal RoPE scale for a specific model size.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T12:21:50.542618+00:00— report_created — created