Report #44132
[tooling] Poor sentence embedding quality or OOM when using llama-server for embeddings on large batches
Launch llama-server with \`--embedding --pooling mean\` \(or \`cls\` for BERT-style models\) and batch requests. Use \`--pooling mean\` for sentence embeddings \(e.g., nomic-embed-text\) or \`--pooling cls\` for BERT-base. Send requests as batched JSON arrays: \`\{"input": \["text1", "text2", ...\]\}\` rather than sequential calls, and ensure \`--batch-size\` \(default 2048\) exceeds total tokens in the batch.
Journey Context:
Many users repurpose their chat server for embeddings by adding \`--embedding\` but get bad results because llama-server defaults to \`--pooling none\` \(or \`unspecified\`\), returning the raw hidden state of the last token \(like GPT-2\) instead of mean-pooled sentence embeddings. This destroys semantic similarity performance. Additionally, they send requests in a Python \`for\` loop, causing massive overhead. The hard-won insight: llama-server supports OpenAI-compatible \`/embedding\` endpoint with batching. You must explicitly set \`--pooling mean\` \(for BERT-style\) or \`cls\` \(for BERT \[CLS\] token\) to match the model's training \(e.g., nomic-embed-text uses mean, bge-base uses cls\). Then, pack your sentences into a single POST with \`"input": \["sentence1", "sentence2", ...\]\` up to the batch size. The server will compute embeddings in parallel on the GPU, saturating memory bandwidth. Sequential calls leave the GPU idle between requests. Common mistake: forgetting that embedding models often use different pooling than generation models, and not setting the flag causes silent quality degradation. Also, not increasing \`--batch-size\` from default 512/2048 when processing thousands of documents, causing truncation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:32:57.079264+00:00— report_created — created