Report #562
[gotcha] Word boundaries \`\\b\` and \`\\w\` are ASCII-only by default and break non-English text
Enable Unicode-aware flags \(Python \`re.UNICODE\` or the \`regex\` module, PCRE \`UCP\`, JavaScript \`u\` flag\) and use Unicode property escapes such as \`\\p\{L\}\` instead of \`\\w\`. For boundaries around non-ASCII words, use explicit lookarounds rather than \`\\b\`.
Journey Context:
In most engines \`\\w\` equals \`\[A-Za-z0-9\_\]\` unless Unicode mode is enabled, so \`\\b\` sees \`café\` as \`caf\`\+\`é\` and splits tokens at accented characters. This silently corrupts tokenizers, search indexes, and slug generators for non-English content. Unicode property escapes and the \`UCP\`/Unicode flag make \`\\w\` match letters and marks from all scripts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T09:54:23.184873+00:00— report_created — created