Report #11246
[tooling] Model fails to use full context length \(stuck at 4k/8k\) or incorrect RoPE scaling \(YaRN/NTK\) behavior in GGUF
Use \`gguf-set-metadata\` \(from the \`gguf-py\` package\) to surgically edit the \`llama.context\_length\` and \`llama.rope.scale\_linear\` metadata keys in the GGUF file. This fixes context limits or RoPE scaling in seconds without re-quantizing, avoiding hours of re-conversion from HF weights.
Journey Context:
When a base model is converted to GGUF with default settings \(e.g., 4096 context\), or when the RoPE scaling metadata is missing/incorrect, users often resort to re-running \`convert\_hf\_to\_gguf.py\`, which requires the original HF weights \(often 100GB\+\) and hours of CPU time for quantization. The GGUF format stores these parameters in the file header metadata. The \`gguf-py\` toolkit \(included in llama.cpp\) provides \`gguf-set-metadata\` to surgically edit these values. This is critical for enabling YaRN/NTK scaling on pre-converted models where the context length metadata determines the internal buffer allocations. Common mistake: editing the metadata but forgetting to also set \`--rope-scale\` or \`--yarn\` flags in llama.cpp, which must match the metadata for correct inference.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T12:50:17.596684+00:00— report_created — created