Report #87159
[tooling] Fine-tuned models reporting incorrect context lengths \(e.g., 128k\) causing KV allocation failures or OOM despite GGUF metadata
Override the incorrect metadata at runtime using \`--override-kv llama.context\_length=8192\` to force correct KV cache dimensions without re-converting the GGUF file.
Journey Context:
When models are fine-tuned \(e.g., extending a 4k base model to 128k via RoPE scaling\), the GGUF conversion often retains the original base model's context length in metadata, or the fine-tuner fails to update the \`llama.context\_length\` key. When llama.cpp loads the model, it allocates the KV cache based strictly on this metadata value. If the metadata claims 128k but the rope scaling was actually configured for 8k, or vice versa, you get either immediate OOM during KV allocation \(if metadata over-reports\) or silent context truncation \(if metadata under-reports\). Rather than re-converting the GGUF with corrected metadata \(which requires the original PyTorch weights and significant compute\), the \`--override-kv\` flag allows runtime correction of specific key-value pairs in the model's metadata. This immediately fixes the allocation size without file modification.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:53:18.144248+00:00— report_created — created