How does notmuch detect the presence of attachments?

Daniel Kahn Gillmor dkg at fifthhorseman.net
Thu Aug 25 07:21:21 PDT 2011


On 08/03/2011 06:01 AM, moabi2000 wrote:
> 1) How does  notmuch detect the presence of attachments? I have some
> messages that have attachments (which I can see and open when reading
> the message), but for which the 'attachment' flag is not set (and
> therefore don't show up in a search like "from:myfriend AND
> attachment:pdf"). How can I try to work out what is going on?

According to lib/index.cc (around line 366 in the current version), the
tag "attachment" is added to an e-mail only if one of the MIME parts of
the message has an explicit "Content-Disposition: attachment" MIME
subheader.

So some mail clients may be attaching files with "Content-Disposition:
inline" (i do this sometimes when attaching text/* files) or without a
Content-Disposition: header on the MIME part at all.

Perhaps notmuch could keep a (configurable?) list of Content-Types that
should be tagged with "attachment" no matter what Content-Disposition is
used?  I could imagine an initial list like:

 application/pdf
 application/vnd.oasis.opendocument.text
 application/vnd.oasis.opendocument.spreadsheet

Or maybe just any mime part with "application" as the major Content
type?  That would be a relatively easy (though non-general) heuristic to
implement.  Want to take a crack at it?

> 2) Is there an option for notmuch to also index the text of
> attachments (like recoll does, which also uses xapian)? People tend to
> save attachments with really useless filenames (report2.pdf...), what
> I'd like to be able to do is a search like "from:mycolleague AND
> attachment:pdf AND attachmentcontains:ourproject"

This is another great suggestion for improvement, i think.  There are
even comments in the code (around the same part referenced above) that says:

	/* XXX: Would be nice to call out to something here to parse
	 * the attachment into text and then index that. */

A generic shim here, with a configurable index that associates
Content-Types with safe convert-to-text functions would be quite nice.

This would probably be a new section in ~/.notmuch-config,
[textconverters], where the keys would be a specific Content-Type and
the values would be system calls that take the file on stdin and produce
plain text to index on stdout, like so:

 [textconverters]
 application/pdf=pdf2txt /dev/stdin

Starting with an initially empty set of textconverters seems reasonable
and safe to me, and people could set up their own if they're interested.

You'd need to re-index your message store after modifying the config,
though, if you wanted to have pre-existing messages get indexed this
way.  Is there a way to tell notmuch to re-index a particular message?

The above proposal isn't implemented at all, i'm just throwing it out
for consideration.

	--dkg

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 1030 bytes
Desc: OpenPGP digital signature
URL: <http://notmuchmail.org/pipermail/notmuch/attachments/20110825/dc099cd9/attachment.pgp>


More information about the notmuch mailing list