regex [X-Z] with non-ascii char returns different results from (X|Y|Z)

yury.t tptlab at tuta.io
Sat Aug 24 07:39:10 PDT 2019


Although this thread now might be offtopic, let me send a follow-up.
By searching with C related terms, I found some articles about this issue.  It seems to be a common problem on regex + multibyte in C.  (e.g. https://stackoverflow.com/a/15895746 <https://stackoverflow.com/a/15895746>)

On Wed, Aug 21, 2019 at 12:58:04PM +0000, tptlab at tuta.io <mailto:tptlab at tuta.io> wrote:
> - [1] (U+FF11) is treated as [\x{F000}-\x{FFFF}]

Actually, it becomes [\xef\xbc\x91].  That's why it matches with U+Fxxx (starts with \xef in UTF-8).  And without ^, it matches partial byte of a character, U+4444 (\xe4\x91\x84), U+5C11 (\xeb\xb0\x91) for example.

I'm not familiar with C and don't know whether pcre or \k solve this issue, but it might hard to fix if the root cause is how C handles multibyte strings.


More information about the notmuch mailing list