[PATCH v1 0/3] Improve the acquisition of text parts.
David Bremner
david at tethera.net
Mon Mar 14 04:49:36 PDT 2016
David Edmondson <dme at dme.org> writes:
> On Sun, Mar 13 2016, Mark Walters wrote:
>> However, it would be sensible to get testing in a greater variety of
>> charsets/encodings
>
> Agreed. Does anyone have suggestions on how we might achieve this? A
> corpus of mail that we could use?
Maybe the notmuch performance corpus, particularly the lkml sample.
grep -R charset= performance-test/corpus/mail/lkml | sed -e 's/^.*charset=//' -e 's/;.*//' -e 's/"//g' | tr '[A-Z]' '[a-z]' | sort -u
gives
euc-kr
gb2312
iso-2022-jp
iso-2022-jp-2
iso-8859-1
iso-8859-14
iso 8859-15
iso-8859-15
iso-8859-1
iso-8859-2
iso-8859-6
iso-8859-7
iso-8859-9
koi8-r
koi8-u
ks_c_5601-1987
shift_jis
unknown
unknown-8bit
us-ascii
utf8
utf-8
windows-1250
windows-1251
windows-1252
windows-1255
to unpack the corpus
cd performance-test
make download-corpus
./T00-new.sh --large
probably interrupt the test once notmuch-new starts running.
More information about the notmuch
mailing list