Report #39707
[tooling] Slow embedding generation or high memory usage with local embedding models
Start \`llama.cpp/server\` with the flags \`--embedding --pooling mean\` \(or \`cls\` for BERT-style models\) to expose the \`/embedding\` endpoint, which bypasses autoregressive generation logic and uses optimized matrix operations for batch encoding.
Journey Context:
Users frequently repurpose the \`/completion\` endpoint with \`n\_predict=0\` or use external Python libraries \(SentenceTransformers\) to extract embeddings from local models, incurring Python GIL overhead and unnecessary memory copies. The llama.cpp server has a dedicated embedding mode that extracts hidden states directly from the specified layer \(usually the last layer before the LM head\) and applies pooling \(mean, cls, or last token\) via optimized CUDA/Metal kernels. Without the \`--embedding\` flag, the server doesn't initialize the pooling tensors or expose the \`/embedding\` endpoint. The \`--pooling\` flag is crucial for model compatibility \(e.g., BERT uses CLS, GTE uses mean\). Using this native path allows batching thousands of sequences on the GPU without Python overhead, achieving 10x higher throughput than the completion endpoint hack.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:07:26.180205+00:00— report_created — created