[PATCH 0/3] Speed up notmuch new for unchanged directories
Sascha Silbe
sascha-ml-reply-to-2012-3 at silbe.org
Mon Jun 25 15:13:40 PDT 2012
Austin Clements <amdragon at MIT.EDU> writes:
> On Sun, 24 Jun 2012, Sascha Silbe <sascha-pgp at silbe.org> wrote:
["notmuch new" listing every directory, even if it's unchanged]
> I haven't looked over your patches yet, but this result surprises me.
> Could you explain your setup a little more? How much mail do you have
> and across how many directories? What file system are you using?
As mentioned in passing already, I have a total of about 900k unique
mails (sometimes several copies of them, received over different paths,
e.g. mailing list and a direct CC). Most of that is "old" mails, in
directories that are not getting updated. If notmuch would support mbox,
I'd use that instead for those old mails. The total number of
directories in the mail store is about 29k and the total number of files
(including the git repository and mbox files that sup used) is about
1.25M.
Since a housekeeping job last weekend, the number of mails in
directories that are still getting updated is about 4k, i.e. about 5‰ of
the total number of mails or 3‰ of the total number of files. The number
of directories getting updated is 104, i.e. about 4‰ of the total number
of directories.
Ideally, we'd get the run-time of "notmuch new" down by a similar
factor. With just plain POSIX and no additional information that won't
be possible, but providing a way to channel information about updates
into notmuch (rather than having it scan everything over and over again)
should help. That information is already available as output from the
mail fetching process (rsync in my case). Of course, it would be purely
optional: "notmuch new" without additional information would simply
continue to scan everything.
> I'm also surprised that your new approach helps. This directory listing
> has to be read off disk one way or the other, but listing directories is
> the bread-and-butter of file systems, whereas I would think that Xapian
> would require more IO to accomplish the same effect.
"notmuch new" needs to iterate over a list of all directories to find
those with new mails (and potentially new subdirectories). However, it
does not need to list the *contents* of those folders. I'm surprised as
well, but rather in the opposite direction: Based on a naive
calculation, we'd expect to see a speedup on the order of
(1.25M+29k)/29k = 44. The actual results suggest that stat()ing (done
29k times both before and after the patch) is taking about 19 times as
long as listing a directory entry (before the patch we listed 1M
entries, now we list none if nothing has changed). (*)
In practice, the speedup achieved by my patch is larger than what the
benchmark suggests because there are other processes running that use
RAM. If we need to read a lot from disk (like "notmuch new" did before
my patch), there's a good chance it's already been evicted from the
cache since the last run. The fewer we need to read, the more likely it
is to still be in the cache. Similarly, reading lots of data from disk
will displace other data in the cache. These effects are not covered by
the pure "hot cache" and "cold cache" timings.
> Does your patch win because you can specifically list subdirectories
> out of Xapian, making the IO proportional to the number of
> subdirectories instead of the number of subdirectories and files (even
> though the constant factors probably favor reading from the file
> system)?
It wins because the factor is the number of files in each directory, not
just some low constant based on file system overhead vs. Xapian
overhead.
> I like the idea of these patches, I just want to make sure I have a firm
> grip on what's being optimized and why it wins.
Certainly a good idea. Thanks for taking the time!
Sascha
(*) float(linsolve([29000*x + 1250000*y = 3.3 * 29000*x], [x])); in
maxima, if you'd like to check the math.
--
http://sascha.silbe.org/
http://www.infra-silbe.de/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 489 bytes
Desc: not available
URL: <http://notmuchmail.org/pipermail/notmuch/attachments/20120626/5ac1aa57/attachment.pgp>
More information about the notmuch
mailing list