Report #30507
[cost\_intel] Trailing whitespace in embedding inputs invalidates vector DB caches
Strip whitespace and normalize newlines before embedding; use a consistent preprocessing hash as the cache key instead of raw text.
Journey Context:
Embedding models tokenize 'query' and 'query\\n' differently. If your RAG system caches embeddings by raw text key, a user sending 'What is AI?' and a cron job sending 'What is AI?\\n' generate different embedding vectors and cache misses, forcing redundant embedding API calls. The trap is assuming text normalization happens server-side; embedding endpoints are sensitive to exact byte sequences. The fix is strict preprocessing: strip all trailing whitespace, normalize to single spaces, and hash the normalized string for cache lookup. This prevents 'invisible' characters from causing 2x embedding costs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:35:22.897502+00:00— report_created — created