[PATCH v1 0/3] Improve the acquisition of text parts.

Sat Mar 26 02:18:20 PDT 2016

Hi

Sorry this email ended up rather long:

Summary: I have run a test (see below) on all of the lkml part of the
performance-corpus, and all the changes look expected. So this series
looks good to me.

First note how we do the bodypart-insertion: for a mime type of
text/plain we first try the text/plain handler, then a text/* handler,
and finally a */* handler until one succeeds. Before this series, when
the part is application/octet-stream but is detected as text/plain, 
text/plain handler fails with a "bodypart insertion error" because
notmuch-get-bodypart-text fails can't get the text (because it's not
officially text). Thus we fall back on the */* handler and that inserts
the part. 

With this series notmuch-get-bodypart-text succeeds and we stop.

Thus in most cases the only change is that we don't get a "bodypart
insertion error", but all the text looks the same. In a couple of cases
the text/plain handler wraps lines/replaces ^M by unix newlines, whereas
as the */* handler does not. This is an improvement.

There is one more "difference" but I think this is actually something
random. Sometimes when the part is application/tar or application/zip I
get "Bodypart insert error: Symbol's function definition is void:
gnus-recursive-directory-files". If I load gnus this goes away. In my
first batch of tests this only occurred when using this series, but
since then I have reproduced it on mainline. I think something else I
did when setting up the test on mainline caused gnus to be loaded, but i
have not worked out what is going on there.

Finally, the test was as follows. I downloaded the performance corpus,
configured a separate notmuch config file to use the
performance-test/corpus/mail/lkml as the mailstore, went into
notmuch-emacs and to the inbox (which contained all messages) and ran
the following lisp function

(defun my-save-all-show ()
  (interactive)
  (goto-char (point-min))
  (let ((count 0))
    (while (notmuch-search-find-thread-id)
      (let ((thread-id (notmuch-search-find-thread-id)))
        (setq count (1+ count))
        (message "Thread %s: %s" count thread-id)
        (notmuch-show thread-id)
        (let ((text (buffer-string))
              (coding-system-for-write 'no-conversion))
          (with-temp-file (concat "OUTPUT-" thread-id) (insert text)))
        (kill-buffer))
      (notmuch-search-next-thread))))

I moved the OUTPUT files elsewhere and repeated with this series applied
and then ran diff on the output. This gave 7 threads with a change (each
an individual message) from the 16000 threads/ 100000 messages which I
looked at individually as above.

Best wishes

Mark

On Mon, 14 Mar 2016, David Bremner <david at tethera.net> wrote:
> David Edmondson <dme at dme.org> writes:
>
>> On Sun, Mar 13 2016, Mark Walters wrote:
>>> However, it would be sensible to get testing in a greater variety of
>>> charsets/encodings
>>
>> Agreed. Does anyone have suggestions on how we might achieve this? A
>> corpus of mail that we could use?
>
> Maybe the notmuch performance corpus, particularly the lkml sample.
>
> grep -R charset= performance-test/corpus/mail/lkml | sed -e 's/^.*charset=//' -e 's/;.*//' -e 's/"//g' | tr '[A-Z]' '[a-z]' | sort -u
>
> gives
>
> euc-kr
> gb2312
> iso-2022-jp
> iso-2022-jp-2
> iso-8859-1
> iso-8859-14
> iso 8859-15
> iso-8859-15
> iso-8859-1
> iso-8859-2
> iso-8859-6
> iso-8859-7
> iso-8859-9
> koi8-r
> koi8-u
> ks_c_5601-1987
> shift_jis
> unknown
> unknown-8bit
> us-ascii
> utf8
> utf-8
> windows-1250
> windows-1251
> windows-1252
> windows-1255
>
>
> to unpack the corpus
>
> cd performance-test
> make download-corpus
> ./T00-new.sh --large
>
> probably interrupt the test once notmuch-new starts running.