Report #50545
[tooling] Extending context window of existing GGUF requires hours of requantization
Use llama.cpp's \`--override-kv llama.context\_length=N,llama.rope.freq\_base=calc\_val\` to dynamically extend context without regenerating the file, or use \`gguf-set-key\` from gguf-py to edit metadata permanently.
Journey Context:
Users believe the context length is baked into the tensor data and spend hours requantizing from FP16 when they need 32K or 128K context. In reality, GGUF stores hyperparameters like \`context\_length\` and ROPE frequency base as KV metadata separate from the weights. You can override these at runtime. The critical detail is calculating the new ROPE frequency base using the formula \`new\_base = original\_base \* \(scaling\_factor\)^\(dim/\(dim-2\)\)\` where \`scaling\_factor = new\_length / original\_train\_length\`. If you only override context\_length without adjusting ROPE, the model degrades immediately. This approach trades a few minutes of calculation for hours of GPU time.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:19:35.107180+00:00— report_created — created