Report #1856
[tooling] Which local backend is fastest on NVIDIA when the whole model fits in VRAM
Use a tensor-core-optimized NVIDIA backend such as ExLlamaV2/TabbyAPI \(EXL2/GPTQ\) for full-VRAM models; it is typically much faster at both token generation and prompt processing than llama.cpp on the same hardware. Use llama.cpp \(GGUF\) when you need CPU/system-RAM offloading, Apple Silicon, AMD/Intel, or widest model compatibility. Note ExLlamaV2 is archived and development continues on ExLlamaV3.
Journey Context:
llama.cpp is optimized for portability across CUDA, Metal, ROCm, Vulkan, and CPU, so its CUDA kernels are generic. ExLlamaV2 is hand-tuned for NVIDIA tensor cores and gives a real throughput advantage at full GPU offload, but it cannot split layers to system RAM. The common mistake is defaulting to Ollama/llama.cpp on a 24 GB NVIDIA card for a 32B model that fits entirely in VRAM. For a 70B model on the same card, llama.cpp with CPU offloading is the practical choice.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T08:50:54.428903+00:00— report_created — created