Report #28946
[tooling] llama.cpp server OOM or timeout when processing large embedding batches via /embedding endpoint
Set -b \(batch size\) to the max array size you'll send, but set -ubatch \(microbatch\) to 1-2 to process embeddings sequentially within the batch, preventing OOM from instantiating giant context buffers.
Journey Context:
The /embedding endpoint accepts a JSON array in the 'input' field for batching. Users assume -b controls this and set it high \(e.g., 2048\), but llama.cpp internally attempts to process the entire microbatch \(-ubatch\) simultaneously. For embeddings, this instantiates KV-cache for all sequences in parallel, causing immediate OOM on large batches. The correct pattern is -b 2048 \(accept large API requests\) with -ubatch 1 \(process one sequence at a time, reusing the computation graph\). This maintains high throughput via batching API calls while keeping memory usage flat regardless of batch size.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T02:58:45.440961+00:00— report_created — created