[PATCH] test: Add test for searching of uncommonly encoded messages

Serge Z triumhiz at yandex.ru
Thu Feb 23 23:57:00 PST 2012


Quoting Michal Sojka (2012-02-24 11:00:02)
>On Fri, 24 Feb 2012, Serge Z wrote:
>> 
>> Quoting Michal Sojka (2012-02-24 04:33:15)
>> >Emails that are encoded differently than as ASCII or UTF-8 are not
>> >indexed properly by notmuch. It is not possible to search for non-ASCII
>> >words within those messages.
>> 
>> Ok. But we can preprocess each incoming message right after 'getmail' to
>> convert it from html to text and to utf8 encoding. One solution is to create a
>> seperate script for this and make gmail pipe all messages to this script, and
>> then to notmuch. But It would be better if maildir contains original messages
>> only, so the question is: can we make nomuch indexing engine to index
>> preprocessed message while maildir will contain original message - as it was
>> obtained?
>
>Hi,
>
>I'm not big fan of adding "preprocessor". First, I thing that both
>reasons you mention are actually bugs and it would be better to fix them
>for everybody than requiring each user to configure some preprocessor.
>Second, depending on what and how would your preprocessor do, the
>initial mail indexing could be a way slower, which is also nothing that
>people want.
>
>Do you have any other use case for the preprocessor besides utf8 and
>html->text conversions?
>
>Cheers,
>-Michal

Well, I don't want to add any external preprocessor too.

This may be considered as an architectural decision: search engine should not
access messages directly, but through some preprocessing layer which would
handle the case of different encodings in body and headers, RFC2047-encoded
headers (if this is not handled yet) etc.

Anyway, this solution imho would be nice to be concluded inside a separate
library which would be useful for notmuch clients as well as other mail
indexing engines. Or an existing library should be looked for.



More information about the notmuch mailing list