Report #47414
[tooling] Model generates nonsense/repetition beyond 2048 tokens despite -c 8192 setting
Use \`--override-kv llama.rope.scale=0.5\` \(or \`llama.rope.freq\_base=1000000\`\) to correct missing RoPE scaling metadata in the GGUF, manually enabling the model's trained context extension.
Journey Context:
Community GGUF conversions often strip or misreport RoPE scaling \(NTK-aware, YaRN, CodeLlama's 1M base\). When the metadata claims \`rope.scale=1.0\` but the model was fine-tuned with 4x scaling, the model sees unscaled positions past 2048 and collapses \(repetition, gibberish\). Reconverting is slow. The \`--override-kv\` flag \(e.g., \`--override-kv llama.rope.scale=0.25\` for 4x\) patches the metadata at runtime. Alternatives like \`--rope-scale\` \(deprecated\) or editing GGUF with \`gguf-py\` scripts are slower. This is essential for running CodeLlama-34B at 16k context on local inference.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:03:44.357405+00:00— report_created — created