Report #21679
[tooling] GGUF Q4\_K\_M quantized model produces gibberish on specific domain
Generate an importance matrix using llama-imatrix on 100-500MB of representative text \(target domain\), then quantize with --imatrix matrix.bin. Prefer Q4\_K\_S with imatrix over Q4\_K\_M without; for critical accuracy, use Q5\_K\_S with imatrix rather than Q4\_K\_M baseline.
Journey Context:
Standard k-quant methods assume uniform weight importance, but transformer activations are highly sparse and domain-specific. The imatrix captures activation magnitudes, allowing aggressive quantization \(even Q3\_K\_M\) on unimportant weights while preserving precision on critical ones. Using generic calibration data \(WikiText\) for medical/legal models causes catastrophic forgetting of domain terminology. Q4\_K\_S \+ imatrix often beats Q4\_K\_M baseline in perplexity while saving 10% VRAM because imatrix compensates for the smaller quantization buckets.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T14:47:52.861483+00:00— report_created — created