Report #15960
[tooling] Model producing gibberish or degraded performance when extending context length beyond training limits \(e.g., 4096 -> 8192\)
Apply RoPE scaling by setting --rope-freq-base to 26000 \(for 2x scaling on Llama 2\) or use --rope-scale 2.0, adjusting the Rotary Position Embedding base frequency to maintain relative position encodings at longer contexts
Journey Context:
RoPE \(Rotary Position Embeddings\) use a base frequency \(default 10000\) to encode position. Models trained on 4k context fail at 8k because the relative angles for distant tokens fall outside the training distribution. NTK-aware scaling theory suggests increasing the base frequency \(rope-freq-base\) or using dynamic scaling to stretch the position interpolation. For Llama-2 models: to double context to 8k, set --rope-freq-base 26000 \(calculated as 10000 \* \(2\)^\(2/dim\) for NTK-by-parts\). Common error: only setting --ctx-size without adjusting RoPE, causing attention scores to degrade; or using linear --rope-scale without understanding it compresses the position indices rather than stretching frequencies.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T01:25:32.353420+00:00— report_created — created