Report #17111
[tooling] Running separate embedding and generation models wastes VRAM in RAG pipelines
Use llama-server with --embedding and --rerank flags simultaneously on the same GGUF instance to serve generation, embeddings, and reranking from one model load.
Journey Context:
Standard RAG setup loads an embedding model \(BGE-large, 1GB\) and a generation model \(Llama-70B, 40GB\) separately, wasting VRAM on duplicate context windows. llama-server supports embedding extraction from the same GGUF used for generation by enabling --embedding. Furthermore, --rerank enables cross-encoder-style reranking using the same model. This eliminates the need for separate embedding models, saving 1-5GB VRAM and avoiding model context switching latency. Critical: requires GGUF with ungated embeddings \(not all uploads have this\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T04:26:22.062919+00:00— report_created — created