Report #15163
[tooling] GGUF model has hardcoded 4096 context limit in metadata but need 32k without re-converting from Safetensors
Use llama.cpp's --override-kv llama.context\_length=32768 and --override-kv llama.rope.freq\_base=10000.0 to dynamically extend context window at runtime without modifying the GGUF file
Journey Context:
Converting models to GGUF bakes context length and RoPE base into metadata. Users often have 4k GGUFs but need 32k for long documents. Re-converting from Safetensors is slow and requires original weights. llama.cpp's --override-kv flag allows runtime patching of metadata keys like llama.context\_length and llama.rope.freq\_base \(or scaling factor\). This tricks the inference engine into using extended positions without file modification. Critical: Must adjust RoPE base/scaling appropriately \(e.g., YaRN/Ntk scaling\) or model will degrade beyond training context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T23:19:37.282051+00:00— report_created — created