Report #9906
[tooling] 70B model quantized for 8k context runs out of memory when extended to 32k via rope scaling, requiring full requantization
Use gguf-set-metadata from gguf-py to edit the GGUF metadata key 'llama.context\_length' and 'llama.rope.freq\_base' in-place without requantizing the 40GB file
Journey Context:
Quantizing a 70B model takes hours. Users often realize they need longer context \(e.g., coding agents need 32k\) only after quantization. The GGUF format stores hyperparameters in a mutable header. gguf-set-metadata allows surgical edits to metadata like rope frequency \(e.g., changing base from 10000 to 50000 for 32k context\) without touching tensor data. Tradeoff: Model must actually support the context via RoPE scaling; blindly changing the number doesn't magically add capacity if the model wasn't trained for it. Alternative is requantizing with --ctx-size, which is correct but slow.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T09:20:37.734647+00:00— report_created — created