Report #81928
[tooling] Loading separate embedding model wastes RAM alongside generation model
Enable llama.cpp server's embedding endpoint to reuse the loaded GGUF for both generation and embeddings. Launch with ./server -m model.gguf --embedding --pooling mean. This exposes OpenAI-compatible /v1/embeddings using the model's hidden states \(mean pooled\), eliminating the need to load a separate embedding model like BGE-large \(1.3GB\+\) in memory.
Journey Context:
Modern generative models \(Llama 3, Qwen2, Mistral\) produce high-quality sentence embeddings via mean/max pooling of final layer hidden states, often outperforming dedicated embedding models on MTEB. Running separate embedding services requires loading a second model into RAM/VRAM, doubling memory footprint and complicating deployment. The --embedding flag enables the endpoint, but critically requires --pooling \(mean or cls\) because GGUF metadata often lacks pooling type specification. Without explicit pooling, the server may return raw token embeddings or error. Common pitfall: using embedding endpoint with causal models without instruct-tuning for embeddings, yielding poor performance; solution: use instruct-based generative models or add 'Represent this sentence for searching...' prefixes in client code.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:06:23.544403+00:00— report_created — created