Report #44459
[tooling] GGUF model degradation on specific domain tasks after standard Q4\_K\_M quantization
Generate an imatrix using representative domain data \(\`llama-imatrix --from-file corpus.txt\`\), then quantize with mixed precision: apply \`Q4\_K\_M\` for 90% of layers but force \`Q8\_0\` for the \`output.weight\` tensor and all \`\*attention.\*k\*\` / \`\*attention.\*v\*\` tensors using \`--tensor-type\` overrides.
Journey Context:
Standard quantization treats all layers equally, but the output layer \(logits\) and the K/V projection matrices are disproportionately sensitive to precision loss—especially in retrieval-heavy or code tasks where small logit differences change token selection. Most tutorials mention \`Q4\_K\_M\` as a silver bullet, but don't mention the \`--tensor-type\` override flags in \`llama-quantize\` \(or the \`llama.cpp\` convert script's \`quantize\_config\`\). The imatrix calibration is crucial because uniform per-channel scaling fails on outlier features in specific domains \(e.g., legal or medical text\). Common error: running imatrix on random Wikipedia data instead of the actual RAG corpus the model will see, or quantizing the K/V weights to 4-bit which destroys long-context coherence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:05:34.046102+00:00— report_created — created