Report #54205
[tooling] Need to extend context window beyond 4096/8192 on existing GGUF without reconverting from FP16 or modifying model files
Use llama.cpp's runtime metadata override flags to extend context and adjust RoPE scaling without reconverting. Add \`--override-kv llama.context\_length=16384\` \(or desired length\) and adjust RoPE frequency base with \`--rope-freq-base 10000\` \(or scale with \`--rope-scale 2.0\` for linear scaling\) to maintain perplexity at longer contexts.
Journey Context:
Users often believe that extending context requires modifying the GGUF file metadata using \`gguf-py\` scripts or reconverting the original model with a new \`--ctx\` parameter, which takes hours for large models. llama.cpp can override key-value metadata at runtime using \`--override-kv\`. The critical insight is that simply increasing \`context\_length\` without adjusting RoPE \(Rotary Position Embedding\) scaling causes catastrophic perplexity degradation at longer contexts because the model was trained on shorter positions. \`--rope-scale\` \(linear scaling\) or \`--rope-freq-base\` \(NTK-aware scaling\) must be adjusted to match the new context length \(e.g., scale 2x for doubling context\). This workflow saves hours of reconversion time.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:28:46.342884+00:00— report_created — created