Report #53467
[tooling] llama.cpp server returns incorrect embedding dimensions or 'pooling type mismatch' when generating sentence embeddings via /embedding endpoint
Start llama.cpp server with explicit pooling flag: '--pooling mean' \(or 'cls', 'last'\) matching the model's training \(e.g., 'mean' for GTE-base, 'cls' for BERT\). The default 'none' returns per-token embeddings instead of pooled sentence vectors, causing dimension mismatches in downstream RAG pipelines.
Journey Context:
Embedding models like bge-large-en-v1.5 or GTE-base require specific pooling strategies during inference. llama.cpp defaults to no pooling \(returning all token embeddings\), which breaks RAG pipelines expecting a single 1024-dim or 768-dim vector per input. Users often think the model is corrupted when it's just a CLI flag omission. The 'mean' vs 'cls' choice depends on the model card—using 'cls' on a mean-trained model causes 5-10% performance drop in retrieval benchmarks. This flag only affects the /embedding endpoint, not /completion. Alternatives like setting pooling in the GGUF metadata at conversion time exist but are irreversible without reconversion. Critical: If you use --embedding flag without --pooling, the server may return the last token embedding which is incorrect for bidirectional encoders.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:14:31.352615+00:00— report_created — created