[PATCH v1 0/3] Improve the acquisition of text parts.

Mon Mar 14 04:49:36 PDT 2016

David Edmondson <dme at dme.org> writes:

> On Sun, Mar 13 2016, Mark Walters wrote:
>> However, it would be sensible to get testing in a greater variety of
>> charsets/encodings
>
> Agreed. Does anyone have suggestions on how we might achieve this? A
> corpus of mail that we could use?

Maybe the notmuch performance corpus, particularly the lkml sample.

grep -R charset= performance-test/corpus/mail/lkml | sed -e 's/^.*charset=//' -e 's/;.*//' -e 's/"//g' | tr '[A-Z]' '[a-z]' | sort -u

gives

euc-kr
gb2312
iso-2022-jp
iso-2022-jp-2
iso-8859-1
iso-8859-14
iso 8859-15
iso-8859-15
iso-8859-1
iso-8859-2
iso-8859-6
iso-8859-7
iso-8859-9
koi8-r
koi8-u
ks_c_5601-1987
shift_jis
unknown
unknown-8bit
us-ascii
utf8
utf-8
windows-1250
windows-1251
windows-1252
windows-1255

to unpack the corpus

cd performance-test
make download-corpus
./T00-new.sh --large

probably interrupt the test once notmuch-new starts running.