Report #35653
[cost\_intel] Text-embedding-3-large truncation at 8k tokens causing silent data loss on long documents
Pre-chunk documents to 8k tokens \(not characters\) before embedding; truncation happens at 8k tokens and loses information silently
Journey Context:
text-embedding-3 models truncate at 8191 tokens. If you send a 50k token document, it silently truncates to the first 8k, embedding only the intro. When you query, you get bad matches and have to re-embed with proper chunking, paying 2x. Worse, developers often chunk by characters \(e.g., 4000 chars\) not realizing that tokens \!= characters; code can be 1:1 but text is ~4 chars per token, so 4000 chars is only ~1000 tokens, leaving headroom but creating many small chunks. The correct approach is tokenizer-aware chunking to exactly 8k token boundaries with overlap.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:19:07.102485+00:00— report_created — created