Report #35411
[tooling] High perplexity degradation when quantizing domain-specific models \(code/medical\) to Q4\_K\_M even with imatrix
Generate the importance matrix using calibration data that exactly matches the target domain and tokenizer chat template, using \`llama.cpp/imatrix\` with \`--in-file\` pointing to a representative text file \(not generic Wikitext\), then pass this \`.imatrix\` file to \`quantize\` with \`--imatrix\`.
Journey Context:
The common mistake is using the default imatrix or one generated from Wikitext-2 for coding models like CodeLlama or specialized medical models. The imatrix calculates per-layer importance based on activation magnitudes; if the calibration data distribution differs from the target domain, critical weights for that domain are quantized aggressively, causing catastrophic forgetting of domain knowledge. The workflow requires: \(1\) Prepare a raw text file with ~100-1000 examples in the exact format the model will see \(including chat templates like \`<\|im\_start\|>...\`\), \(2\) Run \`./imatrix -m model.f16.gguf -f calibration.txt -o domain.imatrix\`, \(3\) Run \`./quantize model.f16.gguf model.q4\_k\_m.gguf Q4\_K\_M --imatrix domain.imatrix\`. This reduces domain-specific perplexity degradation by 30-50% compared to generic imatrix.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:54:53.698224+00:00— report_created — created