emacs complains about encoding?

Tue May 22 06:21:41 PDT 2012

Michal Sojka <sojkam1 at fel.cvut.cz> writes:

> Hello Adam,
>
> Adam Wolfe Gordon <awg+notmuch at xvx.ca> writes:
>> It turns out it's actually not the emacs side, but an interaction
>> between our JSON reply format and emacs.
>>
>> The JSON reply (and show) code includes part content for all text/*
>> parts except text/html. Because all JSON is required to be UTF-8, it
>> handles the encoding itself, puts UTF-8 text in, and omits a
>> content-charset field from the output. Emacs passes on the
>> content-charset field to mm-display-part-inline if it's available, but
>> for text/plain parts it's not, leaving mm-display-part-inline to its
>> own devices for figuring out what the charset is. It seems
>> mm-display-part-inline correctly figures out that it's UTF-8, and puts
>> in the series of ugly \nnn characters because that's what emacs does
>> with UTF-8 sometimes.
>>
>> In the original reply stuff (pre-JSON reply format) emacs used the
>> output of notmuch reply verbatim, so all the charset stuff was handled
>> in notmuch. Before f6c170fabca8f39e74705e3813504137811bf162, emacs was
>> using the JSON reply format, but was inserting the text itself instead
>> of using mm-display-part-inline, so emacs still wasn't trying to do
>> any charset manipulation. Using mm-display-part-inline is desirable
>> because it lets us handle non-text/plain (e.g. text/html) parts
>> correctly in reply, and makes the display more consistent (since we
>> use it for show). But, it leads to this problem.
>>
>> So, there are a couple of solutions I can see:
>>
>> 1) Have the JSON formats include the original content-charset even
>> though they're actually outputting UTF-8. Of the solutions I tried,
>> this is the best, even though it doesn't sound like a good thing to
>> do.
>>
>> 2) Have the JSON formats include content only if it's actually UTF-8.
>> This means that for non-UTF-8 parts (including ASCII parts), the emacs
>> interface has to do more work to display the part content, since it
>> must fetch it from outside first. When I tried this, it worked but
>> caused the \nnn to show up when viewing messages in emacs. I suspect
>> this is because it sets a charset for the whole buffer, and can't
>> accommodate messages with different charsets in the same buffer
>> properly. Reply works correctly, though.
>>
>> 3) Have the JSON formats include the charset for all parts, but make
>> it UTF-8 for all parts they include content for (since we're actually
>> outputting UTF-8). This doesn't seem to fix the problem, even though
>> it seems like it should.
>>
>> If no one has a better idea or a strong reason not to, I'll send a
>> patch for solution (1).
>
> Thank you very much for your analysis. It encouraged me to dig into the
> problem and I've found another solution, which might be better than
> those you suggested.
>
> I traced what Emacs does with the text inside
> notmuch-mm-display-part-inline and the wrong charset conversion happens
> deeply in elisp code in mm-with-part called by mm-get-part, which is in
> turn called by mm-inline-text. There is a way to make mm-inline-text not
> to call mm-get-part, which is to set the charset to 'gnus-decoded. This
> sounds like something that applies to our situation, where the part is
> already decoded.

You've digged deeper than I did... :)

>
> The following patch (apply it with git am -c) solves the problem for me.
> However, I'm not sure it is a universal solution. It sets the charset
> only if it is not defined in notmuch json output and I'm not sure that
> this is correct. text/html parts seem to have charset defined, but as
> you wrote that json is always utf-8, so it might be that we need
> 'gnus-decoded always, independently of the json output. What do you
> think?

No -- when non-inlined content is fetched by executing command
notmuch show --format=raw --part=n --decrypt id:"<message-id>" the content
is received with original charset -- and then mm-* components needs to have
correct charset set (well, I think, I have not tested ;). 

Also, we cannot rely that the json output doesn't contain content-charset
information in the future...

I'm currently applying this to my build tree whenever I rebuild notmuch for
my own use: id:"1337533094-5467-1-git-send-email-tomi.ollila at iki.fi"

I think the current plan is to use the same decoding lookup table that
notmuch-show is using in reply too. That is good plan for consistency
point of view. That just requires some code to be moved from
notmuch-show.el to some other file (maybe a new one).

> -Michal

Tomi

>
> ----8<-------
> diff --git a/emacs/notmuch-lib.el b/emacs/notmuch-lib.el
> index 7fa441a..8070f05 100644
> --- a/emacs/notmuch-lib.el
> +++ b/emacs/notmuch-lib.el
> @@ -244,7 +244,7 @@ the given type."
>  current buffer, if possible."
>    (let ((display-buffer (current-buffer)))
>      (with-temp-buffer
> -      (let* ((charset (plist-get part :content-charset))
> +      (let* ((charset (or (plist-get part :content-charset) 'gnus-decoded))
>              (handle (mm-make-handle (current-buffer) `(,content-type (charset . ,charset)))))
>         ;; If the user wants the part inlined, insert the content and
>         ;; test whether we are able to inline it (which includes both
> _______________________________________________
> notmuch mailing list
> notmuch at notmuchmail.org
> http://notmuchmail.org/mailman/listinfo/notmuch