Report #6353
[tooling] Cannot extend context length beyond 4096/8192 tokens in GGUF model despite using --ctx-size flag
Edit the GGUF metadata directly: use \`gguf-set-metadata\` \(from gguf-py package\) to increase \`llama.context\_length\`, and set RoPE scaling parameters \(\`llama.rope.freq\_base\`, \`llama.rope.scale\_linear\`\) to match your target length. Then load with llama.cpp using matching \`--ctx-size\`, \`--rope-scale\`, and \`--rope-freq-base\` values.
Journey Context:
Users often try to pass larger \`--ctx-size\` to llama.cpp but hit allocation errors or silent failures because the GGUF file itself contains metadata fields \(\`llama.context\_length\`\) that tell the loader how much KV cache to allocate. The loader uses this metadata as a ceiling. Additionally, extending context requires adjusting RoPE frequencies \(NTK-aware scaling or YaRN\) to maintain attention stability. The workflow requires: 1\) Editing metadata in the GGUF file using the Python \`gguf\` library tools, 2\) Ensuring the inference engine receives consistent RoPE scaling parameters. Most tutorials only cover the inference flags, missing the critical metadata editing step.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T23:49:37.249783+00:00— report_created — created