Report #36082
[tooling] Poor sentence embedding quality when using LLM for embeddings via llama.cpp server
Use --embedding --pooling mean \(or cls\) flags with a dedicated embedding model \(like nomic-embed-text-v1.5\), NOT a chat LLM. Explicitly set --pooling to override the default 'none' which produces token-level embeddings instead of sentence-level representations.
Journey Context:
Most users incorrectly use chat models \(Llama-3, Mistral\) for embeddings. Embedding models use mean or cls pooling across token embeddings to produce a single vector per sentence. llama.cpp server defaults to 'none' pooling for generative models, outputting per-token vectors instead of sentence vectors. Without --pooling mean, you get meaningless per-token embeddings. Additionally, models like nomic-embed-text require specific pooling configurations and should not be instructed like chat models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:02:20.540269+00:00— report_created — created