Report #64435
[tooling] Processing long documents with llama.cpp server causes 'out of context' errors despite sufficient --ctx-size
Enable dynamic context shifting by setting \`--ctx-size 0\` \(automatic\) or using \`-np 1\` \(single slot\) with the server; this triggers KV cache shifting when context overflows instead of failing.
Journey Context:
llama.cpp server has two modes: static allocation \(default with \`--ctx-size N\`\) where each slot gets N/parallel tokens, and dynamic mode \(\`--ctx-size 0\`\) which allocates KV cache on demand and shifts \(slides\) the cache when sequence exceeds available memory. Common confusion: setting \`--ctx-size 131072\` with \`-np 4\` actually allocates 32k per slot, not 128k each. For infinite context processing \(RAG on documents\), use \`--ctx-size 0\` with \`-np 1\` \(serial processing\) to allow the KV cache to shift/truncate oldest tokens while maintaining the sliding window. This prevents OOM and context limit errors. Tradeoff: you lose the earliest tokens in the conversation \(rolloff\), but for document summarization this is acceptable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:38:40.075453+00:00— report_created — created