[PATCH 0/3] Speed up notmuch new for unchanged directories

Mon Jun 25 18:47:54 PDT 2012

Quoth Sascha Silbe on Jun 26 at 12:13 am:
> Austin Clements <amdragon at MIT.EDU> writes:
> > On Sun, 24 Jun 2012, Sascha Silbe <sascha-pgp at silbe.org> wrote:
> 
> ["notmuch new" listing every directory, even if it's unchanged]
> > I haven't looked over your patches yet, but this result surprises me.
> > Could you explain your setup a little more?  How much mail do you have
> > and across how many directories?  What file system are you using?
> 
> As mentioned in passing already, I have a total of about 900k unique
> mails (sometimes several copies of them, received over different paths,
> e.g. mailing list and a direct CC). Most of that is "old" mails, in
> directories that are not getting updated. If notmuch would support mbox,
> I'd use that instead for those old mails. The total number of
> directories in the mail store is about 29k and the total number of files
> (including the git repository and mbox files that sup used) is about
> 1.25M.
> 
> Since a housekeeping job last weekend, the number of mails in
> directories that are still getting updated is about 4k, i.e. about 5‰ of
> the total number of mails or 3‰ of the total number of files. The number
> of directories getting updated is 104, i.e. about 4‰ of the total number
> of directories.
> 
> Ideally, we'd get the run-time of "notmuch new" down by a similar
> factor. With just plain POSIX and no additional information that won't
> be possible, but providing a way to channel information about updates
> into notmuch (rather than having it scan everything over and over again)
> should help. That information is already available as output from the
> mail fetching process (rsync in my case). Of course, it would be purely
> optional: "notmuch new" without additional information would simply
> continue to scan everything.

This would be great.  I've been thinking along similar lines for a
while (in my case, I want to feed notmuch new from inotify), though I
haven't written any code for it.

> > I'm also surprised that your new approach helps.  This directory listing
> > has to be read off disk one way or the other, but listing directories is
> > the bread-and-butter of file systems, whereas I would think that Xapian
> > would require more IO to accomplish the same effect.
> 
> "notmuch new" needs to iterate over a list of all directories to find
> those with new mails (and potentially new subdirectories). However, it
> does not need to list the *contents* of those folders. I'm surprised as
> well, but rather in the opposite direction: Based on a naive
> calculation, we'd expect to see a speedup on the order of
> (1.25M+29k)/29k = 44. The actual results suggest that stat()ing (done
> 29k times both before and after the patch) is taking about 19 times as
> long as listing a directory entry (before the patch we listed 1M
> entries, now we list none if nothing has changed). (*)

For a cold cache, these aren't the numbers that matter.  With an HDD
and how few files your directories contain on average, only seeks will
matter.  I would expect your workload without your patch to have at
least 1 but closer to 2 seeks per directory: one to stat the directory
and one to get the directory contents block.  Some of the stat seeks
will be eliminated by the buffer cache, even starting cold, because of
inode locality (absolute best case is 16x reduction, but if you
created the directories over time, then this locality is probably
quite poor).  There are a few other potential seeks to get the
directory document from Xapian and to get its mtime value, but those
should exhibit strong locality, so they probably don't contribute
much.  NewEgg says your drive has an average seek time of 8.9ms, so
with 29k directories and assuming your directories are sequential on
disk, that's at least 258s and closer to 512s, which agrees with your
benchmark results.

I'm surprised by your results because I would expect your workload
with your patches to exhibit about the same number of seeks: one to
stat the directory (same as before) and one for
notmuch_directory_get_child_files, which has to seek in the term index
to get the child directories.  My guess is that this exhibits better
locality because the child directory terms are stored contiguously in
the database's key space (though not necessarily sequentially on disk
unless this is a fresh database).

Unfortunately, I'm not sure of a good way to test this hypothesis.
Any thoughts?