Report #47722
[tooling] GGUF quantized model has high perplexity or degraded performance on specific domain \(code/medical\)
Generate an importance matrix \(imatrix\) using \`llama-imatrix\` with calibration data from your target domain, then pass it to \`llama-quantize\` via \`--imatrix\` to optimize quantization importance
Journey Context:
Standard GGUF quantization treats all weights uniformly, but different layers and tensors have varying sensitivity to quantization error. The imatrix \(importance matrix\) measures activation sensitivity per tensor using calibration data. By generating an imatrix on domain-representative text \(e.g., Python code for coding models, clinical notes for medical models\), the quantizer can allocate bits more intelligently, often achieving Q4\_K\_M quality approaching Q5\_K\_M on perplexity benchmarks. Most tutorials skip this two-step workflow, leading to suboptimal quants.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:34:51.681056+00:00— report_created — created