[PATCH 3/4] Optimize thread search using matched docid sets.

Tue Dec 7 17:20:22 PST 2010

On Thu, 18 Nov 2010 02:38:29 -0500, Austin Clements <amdragon at MIT.EDU> wrote:
> Currently this code uses a bitmap indexed by docid as a simple, fast
> set structure.  This is quite memory-efficient if the docid space is
> dense, even if the largest docid is quite large.  Is there a danger
> that the docid space will be large and sparse?  Is it worth replacing
> this with a smarter bit set structure?

When simply adding messages, the docid space is optimally dense, (see
_notmuch_database_generate_doc_id which generates sequential docid
values).

As messages are removed, the space will become less dense as there is
currently never any reuse of docid values. We could imagine adding some
sort of packing operation to get back to dense packing, (but forcing
that kind of thing on the user isn't so kind, of course).

Meanwhile, Xapian can very efficiently tell us how packed the space is,
(since we can query document count and the last docid allocated). So we
definitely have the ability to tune things automatically if needed.

We're currently using GHashTable in several places, but I've thought of
incorporating a nice, little hash-table implementation that Eric Anholt
wrote some time ago. If we did that, we could intelligently choose
whichever data structure is more memory-efficient depending on how
packed the docid space is.

Personally, though, I'm not that much of a micro-optimizer. As can be
seen in the current thread, I generally leave the optimization to
others. Thanks again, Austin!

-Carl

-- 
carl.d.worth at intel.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://notmuchmail.org/pipermail/notmuch/attachments/20101207/f50eeb67/attachment.pgp>