Report #13835
[tooling] Running separate embedding model alongside llama.cpp LLM wastes VRAM and context window
Start \`llama-server\` with \`--embedding --pooling mean\` \(or \`cls\`\) to expose \`/embedding\` endpoint on the same model instance serving chat, enabling high-throughput vector generation without loading a second model
Journey Context:
Deployments often load a dedicated embedding model \(e.g., BGE-M3\) alongside the LLM, doubling VRAM requirements and complicating orchestration. Since causal LMs like Llama/Mistral already produce rich hidden states, llama.cpp can extract embeddings by mean-pooling or CLS-pooling the final layer. The \`--pooling\` flag is critical: mean pooling works best for general similarity, while cls pooling suits sentence classification. This allows the same Q4\_K\_M 70B model to serve both chat and RAG retrieval, eliminating context window fragmentation and PCIe transfer overhead between separate processes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T19:51:14.690210+00:00— report_created — created