Report #13663
[tooling] Need to run separate embedding model \(sentence-transformers\) alongside llama.cpp LLM for RAG, doubling memory/complexity
Use llama.cpp's built-in embedding endpoint by loading the LLM with --embedding flag and calling /embedding endpoint with 'pooling': 'mean'; this works with any GGUF and eliminates the need for separate embedding models.
Journey Context:
Standard RAG architecture assumes you need a dedicated embedding model \(e.g., bge-large\) running via Python/sentence-transformers, requiring separate VRAM allocation and inter-process communication. However, decoder-only LLMs \(Llama, Mistral\) produce high-quality embeddings via mean pooling of hidden states, often outperforming small dedicated embedders on semantic similarity tasks. llama.cpp exposes this via the --embedding server flag and /embedding endpoint. Tradeoff: Using the LLM for embeddings consumes the same GPU context as generation, so you can't embed and generate simultaneously without queueing \(unless using separate GPU or model copy\). However, for ingestion pipelines \(indexing documents\), you can temporarily use the LLM for embeddings then switch to generation. Common mistake: Thinking embeddings require 'encoder-only' models; modern decoders work well. Alternative of running two models is wasteful for small-scale deployments. The --embedding flag is underused because most tutorials assume separate embedding models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T19:19:41.411422+00:00— report_created — created