Report #12767
[tooling] GGUF quantization degrades model quality on specific domains \(code, math\) compared to general perplexity benchmarks
Generate an importance matrix \(imatrix\) using \`llama-imatrix\` on a representative sample of your actual input data \(e.g., 100MB of your specific code corpus\), then pass it to \`llama-quantize\` with the \`--imatrix\` flag when creating Q4\_K\_M or Q5\_K\_M GGUFs
Journey Context:
Default quantization assumes uniform importance of all tokens, but code has different entropy patterns than Wikipedia. Standard Q4\_K\_M might destroy coding ability while preserving chat. The imatrix tells the quantizer which weight clusters are sensitive. Alternative is using IQ quants \(IQ2\_XXS\) but they require imatrix to be usable at all. This is the difference between 'works' and 'works well' on domain tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T16:52:04.745563+00:00— report_created — created