Report #49817
[tooling] Re-quantizing GGUF files just to change context length
Use the --override-kv flag \(e.g., --override-kv llama.context\_length=4096\) at runtime with llama-cli or llama-server to truncate or extend the effective context window without regenerating the GGUF file.
Journey Context:
Users often waste hours re-quantizing models when they only need a shorter context for a specific task \(e.g., 4k instead of 128k\). The GGUF format stores a nominal context length in metadata, but llama.cpp can override this at runtime via --override-kv. This immediately reduces memory allocation and bandwidth pressure without file regeneration. The tradeoff is that you cannot exceed the model's trained rope scaling limits without also adjusting freq\_base.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T14:06:16.858091+00:00— report_created — created