regex [X-Z] with non-ascii char returns different results from (X|Y|Z)

yury.t tptlab at tuta.io
Wed Aug 21 05:58:04 PDT 2019


Some regular expression returns incorrect results if the pattern contains multibyte characters in square brackets.  The following bracket expression matches subjects not starting with `[1-9]` and returns more results than the parenthesis expression.

(Please note that digits are full width, unicode characters.)




    notmuch count -- 'subject:"/^[1-9]/"' # 961


    notmuch count -- 'subject:"/^(1|2|3|4|5|6|7|8|9)/"' # 32





Somehow non-ascii characters in brackets match with any characters start with same hex code point.  For example:





- [1] (U+FF11) is treated as [\x{F000}-\x{FFFF}]


- ^[倀] (U+5000), ^[啕] (U+5555) and ^[忿] (U+5fff) return same results since they are all "U+5xxx".


Without ^, their results are vary but still contain unrelated subjects.





And curly brackets for repetition also have weird behavior.


If there are two emails whose subject is (A) "1人" and (B) "12人":



- ^(1|2...|9)人 - match A, unmatch B (expected)


- ^(1|2...|9){2}人 - unmatch A, match B (expected)


- ^[1-9]人 and ^[1-9]{2}人 - unmatch both


- ^[1-9]{3}人, {4} and {5} - match A, unmatch B


- ^[1-9]{6}人, {7} and {8} - unmatch A, match B





As noted in manpage of notmuch-search-terms, I surely wrap regular expression with double quotes and entire query with single quotes.  I also increase/decrease $XAPIAN_CJK_NGRAM and rebuild index, but the situation won't change.







More information about the notmuch mailing list