interaction between --format=raw and multipart handling [was: Re: Do not attept to output part raw if part is not GMimePart.]

Mon Jun 27 19:12:30 PDT 2011

On Mon, Jun 27, 2011 at 6:41 PM, Daniel Kahn Gillmor
<dkg at fifthhorseman.net> wrote:
> On 06/27/2011 06:07 PM, Austin Clements wrote:
>> Oh, right, of course.  show_message_part will walk into the parts, so
>> format_part_content_raw will still be called on the leafs of a
>> requested multipart.  Though, this approach results in each leaf being
>> transfer decoded and printed individually, so if you ask for a
>> multipart, you won't get the "raw" contents of the multipart (unless
>> it's part 0), so much as you get the concatenated "raw" contents of
>> each part in the multipart.
>
>
> let's take two labeled examples:
>
> A└┬╴multipart/signed 58292 bytes
> B ├┬╴multipart/mixed 56553 bytes
> C │├╴text/plain 1278 bytes
> D │├╴text/plain attachment [grub-install.out] 54109 bytes
> E │└╴text/x-diff attachment [597538.patch] 496 bytes
> F └╴application/pgp-signature attachment [signature.asc] 900 bytes
>
>
> X└┬╴multipart/signed 3863 bytes
> Y ├╴text/plain 1857 bytes
> Z └╴application/pgp-signature attachment [signature.asc] 900 bytes
>
> (i know, you won't use "A" or "Z" as part IDs once we have hierarchical
> part numbers, but consider them placeholders).
>
> if parts F or Z are ever going to be useful (e.g. to some external
> process that wants to validate the signature by hand), then the tool
> needs to provide some way of producing parts B and Y in a pristine form
> (that is, including MIME headers and without interpreting/applying any
> transfer encodings).
>
> Perhaps this means there are two flavors of "raw" that we should be
> distinguishing, something like:
>
>  0) "source" -- the equivalent to viewing the source of the message,
> with headers and without attempting to reverse transfer-encodings, etc.
>
>  1) "rare" -- (not entirely raw, but still bloody, ha ha) strip headers,
> reverse transfer encodings, etc.
>
> I think our current implementation of --format=raw emits "source" when
> applied to the entire message, but "rare" when applied to one of the parts.

Yes.

> I'm suggesting that it might be useful to be able to get "source" of a
> part.  (and perhaps it might also be useful to get the whole message
> "rare" sometimes?)
>
> My first instinct was: if it's multipart, provide "source", if it's
> single-part, provide "rare".  But that fails for the XYZ case above --
> we'd need Y (which is single-part) to be provided as "source" if we were
> ever to be able to make use of Z on its own, so i don't think it'll be
> that simple.
>
> OTOH, i'm not sure that "rare" is particularly meaningful for non-leaf
> parts.
>
>> That if you ask for a multipart, you should effectively get a slice
>> out of the original message bytes (since multipart/* parts can't have
>> non-identity transfer encodings).  Are you also saying that should
>> extend to transfer encoded leaf parts, too?
>
> hmm.  is it true that multipart/* parts can't have non-identity transfer
> encodings?  that would simplify some things, but i don't have a
> reference handy that says it's the case.

RFC 2045, section 6.4: "If an entity is of type "multipart" the
Content-Transfer-Encoding is not permitted to have any value other
than "7bit", "8bit" or "binary"."  (And, for completeness, section
6.2: "The Content-Transfer-Encoding values "7bit", "8bit", and
"binary" all mean that the identity (i.e. NO) encoding transformation
has been  performed.")

> At any rate, i'm not sure it affects the need for being able to emit
> both "rare" and "source" forms of at least the leaf (non-multipart) parts.
>
> i hope this is all at least somewhat clarifying and not just adding to
> the confusion,

Thanks.  That's actually very informative and solidifies some of
what's been slowly coagulating in my mind.

I was also thinking about the two output variants you describe
(though, being less clever, I was thinking "raw" and "decoded").  The
fact that multipart/* parts can only have identity encodings makes me
wonder if the two could be merged by thinking of the decoded content
of a leaf part as a child/body to the original, encoded part.  On the
other hand, that doesn't make sense for other formats, so perhaps
that's not a fruitful approach.