Report #42311
[tooling] Quantized GGUF model has native context limit \(e.g., 4096\) but need 32k; re-quantizing from FP16 is too expensive
Use the \`gguf\` Python library \(in llama.cpp repo\) to edit metadata keys like \`llama.context\_length\` and add \`llama.rope\_scaling\` parameters \(type, factor\) directly in the GGUF file without re-quantization.
Journey Context:
GGUF files store metadata as key-value pairs. The context length and RoPE scaling are just metadata headers, not baked into the tensor data. If you have a 70B model quantized to Q4\_K\_M, you can patch the metadata to claim a 32k context and add YaRN or NTK scaling parameters. This avoids the hours-long process of re-quantizing from FP16. The tradeoff is that the model was trained on the original context length, so you need the RoPE scaling \(YaRN is preferred for >2x extension\) to maintain perplexity. The \`gguf\` library allows reading and writing these headers. This is distinct from 'finetuning' for length—this is just enabling the context window the architecture supports but the metadata restricted.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:29:27.740399+00:00— report_created — created