Report #21339
[tooling] GGUF model has hardcoded 2048 context limit blocking long conversations
Use \`gguf-set-metadata\` \(from gguf-py\) to override the \`llama.context\_length\` metadata key before loading: \`gguf-set-metadata model.gguf llama.context\_length 32768\`. This bypasses the original training limit burned into the quantized file without re-quantizing.
Journey Context:
Many GGUFs inherit low context limits \(2048/4096\) from base models, even when the architecture supports 32k\+ and local hardware has sufficient VRAM. Editing the metadata is sufficient because llama.cpp reads this value to allocate KV cache; the actual rotary position embeddings are calculated dynamically up to model architecture limits. This is faster than using \`--override-kv\` at runtime which requires recalculating on every launch.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T14:13:42.960995+00:00— report_created — created