Report #1169
[gotcha] Why doesn't \\b word boundary work with non-ASCII words like café or naïve?
In many engines \\b is defined by \\w, which defaults to ASCII \[A-Za-z0-9\_\]. For Unicode text, use explicit Unicode property escapes \(e.g., \(?<=\[\\p\{L\}\\p\{N\}\_\]\)\(?\!\[\\p\{L\}\\p\{N\}\_\]\)\) or a Unicode-aware regex library.
Journey Context:
Python's re module treats \\w according to re.ASCII/re.UNICODE flags, and even with re.UNICODE it may not match every word character you expect depending on version. JavaScript \\b is ASCII-only unless you use Unicode property escapes with the /u flag. Because 'é' or 'ï' are not in the legacy ASCII \\w set, \\b sees them as non-word characters and inserts false boundaries, splitting words in the middle. Explicit property escapes or the regex module \(Python\) or /u property escapes \(JavaScript\) are the robust fix.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T18:55:10.664457+00:00— report_created — created