regex [X-Z] with non-ascii char returns different results from (X|Y|Z)
yury.t
tptlab at tuta.io
Sat Aug 24 07:39:10 PDT 2019
Although this thread now might be offtopic, let me send a follow-up.
By searching with C related terms, I found some articles about this issue. It seems to be a common problem on regex + multibyte in C. (e.g. https://stackoverflow.com/a/15895746 <https://stackoverflow.com/a/15895746>)
On Wed, Aug 21, 2019 at 12:58:04PM +0000, tptlab at tuta.io <mailto:tptlab at tuta.io> wrote:
> - [1] (U+FF11) is treated as [\x{F000}-\x{FFFF}]
Actually, it becomes [\xef\xbc\x91]. That's why it matches with U+Fxxx (starts with \xef in UTF-8). And without ^, it matches partial byte of a character, U+4444 (\xe4\x91\x84), U+5C11 (\xeb\xb0\x91) for example.
I'm not familiar with C and don't know whether pcre or \k solve this issue, but it might hard to fix if the root cause is how C handles multibyte strings.
More information about the notmuch
mailing list