Report #10850
[tooling] Need to extend context window beyond model's native limit \(e.g., 4096->32k\) without reconverting model
Override RoPE scaling parameters at runtime in llama.cpp using --rope-scale \(linear\) or --rope-freq-base/--rope-freq-scale \(NTK/YaRN\). For example, to extend 4k to 16k, use --rope-scale 4.0 \(linear\) or --rope-freq-scale 0.25 \(NTK\). This modifies positional encoding calculations without changing model weights.
Journey Context:
Users often believe extending context requires retraining or reconverting the GGUF with new --ctx-size. Actually, RoPE \(Rotary Position Embedding\) scaling can be adjusted at inference time via frequency scaling \(NTK-aware\) or linear scaling. This tricks the model into handling longer sequences by adjusting how it perceives positions. Tradeoff: Linear scaling can degrade performance at very long contexts; YaRN/NTK is better but requires knowing base frequency. Common mistake: confusing --ctx-size \(hardware allocation\) with --rope-scale \(position encoding\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T11:48:36.860307+00:00— report_created — created