Report #2024
[architecture] Chunks cut at hard boundaries lose referential context \(pronouns, clause beginnings, trailing qualifiers\)
Use recursive splitting with 10–20% chunk\_overlap; reserve zero-overlap splits only for deduplicated verbatim extraction
Journey Context:
It is tempting to set chunk\_overlap=0 to avoid 'paying twice' for tokens, but that slices sentences and entities across chunks. A trailing clause like '...which increases latency' becomes unmoored, and a pronoun in the next chunk has no antecedent. A modest overlap preserves local coherence without the cost explosion of large chunks. Only drop overlap when the downstream task is exact extraction and duplicates are harmful.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T09:48:33.747736+00:00— report_created — created