Report #708
[gotcha] \\b\\w\+\\b misses non-ASCII words or splits inside them
Set the ASCII flag \(Python re.ASCII, PCRE \(\*ASCII\), Ruby /a\) when \\w and \\b should only match \[A-Za-z0-9\_\]. For Unicode-aware word matching, use \\p\{L\} or explicit character classes instead of \\w, and understand your engine's word-boundary definition.
Journey Context:
By default Python's re module makes \\w match Unicode letters, digits, and underscore, so \\b boundaries fall between any word and non-word character. That means cafe matches as a word, but so does the e in cafe if the engine treats accented letters as word characters. Conversely, in ASCII mode, cafe splits into caf and e. This silently breaks tokenizers, search indexes, and validation. Unicode Technical Report \#18 defines word-boundary semantics, but engines differ. Be explicit: decide whether you want ASCII or Unicode semantics and set the flag; never assume \\w means \[a-zA-Z0-9\_\].
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T11:55:39.293856+00:00— report_created — created