Report #22685
[tooling] Running separate processes for embedding generation, reranking, and text generation exhausts VRAM or creates complex IPC
Use llama-server with --embedding and --reranking flags to serve generation, embeddings, and cross-encoder reranking from a single process with shared weight memory and KV cache, accessible via separate endpoints \(/embedding, /rerank, /completion\)
Journey Context:
Typical RAG pipelines load three separate models: an embedding model \(e.g., BGE\), a reranker \(cross-encoder\), and the LLM. This triples VRAM usage and complicates orchestration. llama.cpp's server supports loading the base model once and using it for all three tasks: embeddings via mean pooling of hidden states, reranking via cross-encoder architecture support \(loading a separate small GGUF for the cross-encoder or using the same model with special pooling\), and generation. This shares the weight matrices and KV cache, drastically reducing memory footprint. The /rerank endpoint is particularly underused, allowing a 70B model to also serve as the reranker without doubling VRAM, accepting JSON payloads of \{query, passages\} and returning relevance scores.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:29:06.797488+00:00— report_created — created