Mail archives in Git using ssoma
Eric Wong
e at 80x24.org
Sun Aug 21 14:14:55 PDT 2016
"W. Trevor King" <wking at tremily.us> wrote:
> On Sun, Aug 21, 2016 at 06:37:04PM +0000, Eric Wong wrote:
> > Btw, for public-inbox, I'm using git-fast-import now, so imports are
> > a bit faster and $GIT_DIR/ssoma.index is no longer used. This was
> > crucial for getting git at vger archives imported in a reasonable time.
>
> ssoma-mda imports 22k notmuch messages in around 15 minutes (with
> profiling enabled), and:
In contrast, git at vger is around 300K messages. LKML is well
into the millions, and I hope public-inbox (and git!) can handle
that one day, even on cheap hardware (haven't tried).
One problem I noticed with ssoma-mda is that it gets slower as
more messages get imported, since all those files sit in the
index, and the git index format is bad for incremental updates
with big, flat trees. Big trees are a general problem with git:
I'm now storing blob IDs directly in Xapian and will be
using them more to avoid tree lookups. tree creation
lookups degrade the same way the index does as they
get bigger.
Currently it's using 2/38 of the SHA-1 like git loose
objects; a goal might be to move towards supporting 2/2/36
(or deeper) as Jeff noted substantial object traversal
improvements:
https://public-inbox.org/git/20160805092805.w3nwv2l6jkbuwlzf@sigill.intra.peff.net/
Of course, support for 2/38 will be retained for old
archives/messages.
> $ python -m cProfile -o profile import.py notmuch.mbox
> $ python -c "import pstats; p=pstats.Stats('profile'); p.sort_stats('cumulative').print_stats(10)"
> Sun Aug 21 12:56:49 2016 profile
>
> 101823722 function calls (99078415 primitive calls) in 885.069 seconds
>
> Ordered by: cumulative time
> List reduced from 1145 to 10 due to restriction <10>
>
> ncalls tottime percall cumtime percall filename:lineno(function)
> 70/1 0.002 0.000 885.069 885.069 {built-in method exec}
> 1 0.111 0.111 885.069 885.069 /home/wking/src/notmuch/notmuch-archives.git/import.py:9(<module>)
> 1 0.400 0.400 884.915 884.915 /home/wking/src/notmuch/notmuch-archives.git/import.py:17(import_mbox)
> 22875 0.601 0.000 863.371 0.038 /home/wking/src/notmuch/notmuch-archives.git/ssoma_mda.py:362(deliver)
> 22875 8.943 0.000 810.459 0.035 /home/wking/src/notmuch/notmuch-archives.git/ssoma_mda.py:207(append)
> 22875 0.418 0.000 308.353 0.013 /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:146(write_tree)
> 22875 307.855 0.013 307.855 0.013 {built-in method git_index_write_tree}
> 22874 0.575 0.000 279.293 0.012 /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:238(diff_to_tree)
> 22874 278.501 0.012 278.501 0.012 {built-in method git_diff_tree_to_index}
It looks like writing the index is already the slowest, here, in
terms of total time, too. It might be interesting if you
profiled each *-mda invocation to see the degradation from the
first to last message.
> 22875 0.088 0.000 80.413 0.004 /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:99(read)
>
> 38 ms per ssoma delivery is probably fast enough, especially if you
Not even close for me :)
> are invoking ssoma-mda once per message, since process setup will take a similar amount of time:
>
> $ time python -c 'print("hello")'
> hello
>
> real 0m0.016s
> user 0m0.013s
> sys 0m0.003s
>
> It's possible that fast-import would shave a few ms off the pygit2
> addition (I'm not sure, and maybe pygit2 is faster than fast-import).
> But I doubt it matters enough either way to be worth changing unless
> you are dealing with a really large corpus.
One key feature is fast-import avoids writing an index entirely.
I think pygit2 would have to learn that, too.
More information about the notmuch
mailing list