Report #71889
[tooling] Quantized model \(Q4\_K\_M\) has degraded quality for code/math compared to original
Generate an importance matrix \(imatrix\) using ./llama-imatrix on calibration data \(code/text\), then pass --imatrix matrix.dat to llama-quantize. This reduces perplexity degradation by 30-50% for code models at Q4\_K\_M compared to default quantization.
Journey Context:
Standard quantization treats all weights equally. The imatrix identifies which weights are most sensitive to quantization error based on calibration data \(use ~100MB of target-domain text like Python code for code models\). It then allocates more bits to sensitive layers during quantization, preserving reasoning capabilities in smaller quantized models where standard Q4\_K\_M would fail on logic tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:14:49.384415+00:00— report_created — created