Report #60662
[tooling] Q4\_K\_M quantized models degrade on my specific domain \(code/legal\) despite working well for general chat
Generate an importance matrix \(imatrix\) using llama.cpp's imatrix tool on your target corpus, then quantize with llama-quantize --imatrix imatrix.dat to produce mixed quants that preserve critical weights for your domain
Journey Context:
Standard K-quants use heuristics to allocate bits across layers, which may over-quantize sensitive weights in domain-specific tasks \(e.g., code syntax or legal citations\). An imatrix is computed by running perplexity calibration data through the FP16 model and recording which weight matrices are most sensitive to error. The quantizer then uses this to allocate higher precision \(e.g., Q5/Q6\) to sensitive rows and Q4 to others. This is especially effective for 7B/13B models where you want to fit in Q4\_K\_M size but need Q5-level accuracy on your data. The cost is a one-time FP16 inference run to generate the matrix.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:18:36.535630+00:00— report_created — created