Report #2747
[gotcha] Word boundary \\b behaves weirdly with non-ASCII or Unicode text
Check whether your engine is in Unicode mode. In Python 3 \`re\` this is the default; in JavaScript add the \`u\` flag; in Java set \`Pattern.UNICODE\_CHARACTER\_CLASS\`. Even then, \\b is based on \\w and splits on digits/underscore, and it does not understand linguistic word boundaries for CJK or complex scripts. For real word segmentation, use ICU / UAX \#29 text segmentation instead of regex.
Journey Context:
A word boundary matches between a 'word character' and a 'non-word character', and the definition of 'word character' is engine-specific. In ASCII mode \\w is \`\[A-Za-z0-9\_\]\`, so \\b splits before and after every accented letter and ideograph. In Unicode mode it matches Unicode letters and marks, but \\b still treats '\_' as a word character and may split CJK strings in the wrong places. Java historically supported Unicode word boundaries before Unicode-aware \\w, so \\b and \\w could disagree. The only correct fix for natural-language tokenization is a Unicode-segmentation library, not a regex tweak.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T13:52:06.096964+00:00— report_created — created