Report #275
[gotcha] Regex counts code points/units, not user-perceived characters, breaking emoji and accented text
For character-level operations \(length, indexing, truncation, validation\), operate on grapheme clusters, not single code-point regex matches. Use a Unicode-segmentation library \(Intl.Segmenter in JS, regex module or grapheme in Python\) or implement the algorithm from UAX \#29.
Journey Context:
A 'character' in regex terms is usually a code point, but users perceive a base character plus combining marks as one character. 'é' can be U\+00E9 or U\+0065 U\+0301; a single dot matches only one code point. Emoji like 👨👩👧👦 are multiple code points joined by ZWJ. /^.\{10\}$/ may accept ten combining marks or half an emoji. This silently corrupts text in validation, truncation, and display. The only correct model is the grapheme cluster boundary algorithm.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T02:39:18.966819+00:00— report_created — created