Report #9709
[tooling] Serving separate embedding and completion models wastes VRAM; how to use one GGUF for both RAG embedding and chat?
Launch llama.cpp server with \`--embedding --pooling mean\` \(or \`cls\`/\`last\`\). This exposes \`/embedding\` endpoint using the same loaded model weights, processing prompt pools via mean/CLS token aggregation without unloading/reloading weights. Use \`pooling\` type matching your model's pretraining \(BERT=mean/CLS, GTE=mean\).
Journey Context:
Standard practice loads a dedicated embedding model \(e.g., BGE\) alongside the LLM, doubling VRAM footprint. llama.cpp supports embedding extraction from any causal LM, but the default \`pooling\` varies by architecture. Misconfiguration yields terrible retrieval \(e.g., using last-token pooling on BERT-style models\). This consolidates infrastructure; tradeoff is that causal LMs underperform dedicated embedders on benchmarks, but for in-domain RAG the delta is often acceptable vs 2x VRAM cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T08:50:21.360782+00:00— report_created — created