Report #24562
[tooling] GGUF model has hardcoded 4096 context limit in metadata but supports 128k natively
Run llama.cpp with --override-kv llama.context\_length=128000 to force the context window size at runtime without requantizing, or use 'gguf-set-metadata' from gguf-py to permanently edit the GGUF file's llama.context\_length key.
Journey Context:
Many GGUF files on HuggingFace have conservative context length metadata \(e.g., 4096 for Llama-2 chat\) baked in during conversion, even though the base model \(like Llama 3 8B\) natively supports 128k via RoPE. Users often waste hours requantizing from safetensors with corrected rope scaling just to change one integer in metadata. The --override-kv flag allows runtime patching of GGUF key-value pairs, specifically llama.context\_length and llama.rope.freq\_base. This works because llama.cpp reads this metadata only at load time to allocate KV cache; the actual RoPE computation uses runtime parameters. Critical caveat: this only works if the model architecture supports the target length \(e.g., Llama 2 extended via NTK-aware scaling\). Forcing 128k on a model trained on 4k without proper scaling causes immediate degradation. The alternative—requantizing—is correct only if you need to change rope scaling factors permanently or the model requires NTK-aware embedding modifications.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:38:26.840643+00:00— report_created — created