Report #69317
[tooling] Mixtral 8x7B GGUF Q4\_K\_M slower than expected on GPU
Use Q4\_K\_S or Q5\_K\_M instead of Q4\_K\_M for Mixture-of-Experts \(MoE\) models. The 'medium' K-quant mixes different bit-widths per tensor type that cause misaligned memory access patterns on CUDA tensor cores when dequantizing sparse expert weights.
Journey Context:
Users default to Q4\_K\_M as the 'standard' 4-bit quantization, but MoE architectures route tokens to sparse experts. K-quant types use different scales for different tensor dimensions, and Q4\_K\_M's aggressive compression of the 'expert' feed-forward weights results in non-32-byte-aligned memory reads during the scatter-gather operation. Q4\_K\_S uses more uniform quantization that aligns better to GPU warp memory transactions, while Q5\_K\_M provides sufficient precision without the MoE-specific dequantization overhead.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:49:56.564146+00:00— report_created — created