notmuch for documents

Jameson Rollins jrollins at finestructure.net
Sat Nov 6 13:12:17 PDT 2010


A little while ago on #notmuch, madduck and ojwb mentioned that they
thought notmuch was overly focused on mail.  At the time I thought this
was a silly criticism and defended notmuch as doing what it does
*really* well and that we shouldn't expect notmuch to be all things to
all people.

Yesterday, however, I had the profound realization that madduck and ojwb
are right.

Notmuch stores database entries for email messages.  However, these
messages are nothing more than simple rfc5322 [0] structured documents.
They include nothing more than headers and a text body.

Imagine now that I have a collection of ebooks, each stored in a single
rfc5322-formatted text file:

------------------------------------------------------------
From: Italo Calvino <italo at calvino.com>
Subject: If on a winter's night a traveler
Date: 1979

You are about to begin reading Italo Calvino's new novel,
...
------------------------------------------------------------

I store them all in a directory.  I now create a NOTMUCH_CONFIG with a
database.path that points to that directory, and run notmuch.  Notmuch
works *out of the box* (almost) perfectly to index my collection of
ebooks.  All the notmuch commands work exactly as expected.  I can
search through the bodies, search the titles, search for an author,
search for a publication date, etc.  The emacs interface even works as
expected.  Try it: it really works!  There are only a couple of very
little things that are a little funky:

  * the "headers" in my ebooks aren't exactly intuitive ("From" instead
    of "Author", "Subject" instead of "Title", etc.) and there are some
    missing headers ("Publisher").  I also had to format some of them in
    a strange way (I had to add "<italo at calvino.com>" in the "From"
    field in order to get it to index properly for some reason).

  * The documentation keeps referring to "messages", even though my
    documents are books.  And there are some subcommands that don't seem
    to make sense ("reply" to a book?).

But that's it!  Everything else works as a perfect ebook indexer.  I can
of course even add tags to my books.  Beautiful.  It's really quite
incredible how well it works for this out of the box.  The only other
issue is that my ebooks don't come in rfc5322-formatted files.  I have
to translate them for notmuch to work.

So what would have to be tweaked in notmuch to make it work even better
as an ebook indexer?

  * add some sort of translator to extract the "headers" and "body" from
    my non-rfc5322-formatted ebook files

  * allow me to specify which "headers" from my ebooks I want indexed
    ("Author", "Publisher", etc.)

  * tweak notmuch show to just open the ebook itself in an ebook reader
    instead of outputting it to stdout

  * tweak the documentation

Those are not very big changes.  And yet, with these changes notmuch can
now work for *many* other large classes of structured documents.

Another real world example:

I have hundreds of scientific journal articles on my computer.  They are
all pdf files and each has a corresponding bibtex entry in a flat text
file.  If notmuch could read the headers from the bibtex file and the
body from the text in the pdf (ps2ascii), notmuch would work *perfectly*
as an indexer for my scientific journal articles.

So what do people think about this idea?  Does it make sense to look
into extending notmuch to handle non-mail documents?  We definitely
would *not* want to compromise notmuch as a mail indexer/reader.
Notmuch is the best damn mail system there ever was and we wouldn't want
to mess with that.  Does abstracting everything in notmuch from
"messages" -> "documents" hurt it as a mail system?  What if just the
back-end were abstracted, to allow for different front-ends for
different classes of documents, i.e. "messages", "articles", "books",
"rss feeds", etc.?  Are there any big problems with this proposal that
I'm overlooking?

I'm very interested to hear what others think about this idea.

jamie.

[0] http://tools.ietf.org/html/rfc5322
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 835 bytes
Desc: not available
URL: <http://notmuchmail.org/pipermail/notmuch/attachments/20101106/e4a0b25d/attachment.pgp>


More information about the notmuch mailing list