Report #84103
[tooling] llama-server embedding endpoint returns poor similarity scores for sentence similarity or crashes on batch inputs
Launch server with \`--embedding --pooling mean\` \(critical: default is often \`cls\` which is wrong for generative models\) and send batches using \`input: \["text1", "text2"\]\` JSON array syntax; do NOT use single string input for batches.
Journey Context:
When using generative models \(Llama, Mistral\) as embedders via llama-server, users often get terrible retrieval results \(random similarity scores\) or assume the endpoint is broken. The root cause is that llama-server defaults to \`cls\` pooling \(taking the hidden state of the first token\), which is correct for BERT but catastrophic for autoregressive models where the first token is typically \`\` or unrelated to the sentence meaning. The fix is explicitly setting \`--pooling mean\` \(or \`last\` for some models\) to average all token embeddings. Additionally, batching is non-obvious: one must send a JSON array in the \`input\` field, not repeat the request, and the server must have been started with \`--embedding\` \(which locks the model into embedding mode, disabling completions\). This is a common pitfall because users expect OpenAI API compatibility but the pooling default differs and the mode lock is strict.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:45:37.261694+00:00— report_created — created