[RFC patch 2/2] lib: index message files with duplicate message-ids

Sat Mar 18 14:31:44 PDT 2017

Daniel Kahn Gillmor <dkg at fifthhorseman.net> writes:

> On Thu 2017-03-16 20:34:22 -0400, David Bremner wrote:
>> Daniel Kahn Gillmor <dkg at fifthhorseman.net> writes:
>>>  0) what happens when one of the files gets deleted from the message
>>>     store? do the terms it contributes get removed from the index?
>>
>> That's a good guestion, and an issue I hadn't thought about.
>> Currently there's no way to do this short of deleting all the terms (for
>> all the files (excepting tags and properties, presumably) and
>> reindexing. This will require some more thought, I think.
>
> i didn't mean to raise the concern to drag this work down, i just want
> to make sure the problem is on the table.  dropping all terms on
> deletion and re-indexing remaining files with the same message ID isn't
> terribly efficient, but i don't think it's going to be terribly costly
> either.  we're not talking about hundreds of files per message-id in
> most normal cases; usually only two (sent-to-self,
> recvd-from-mailing-list), and maybe a half-dozen at most (messages sent
> to multiple mailboxes that all forward to me).

I can think of 3 general approaches at the moment. They each have (at
least) one gotcha; more precisely they each require some added
complexity somewhere else in the codebase.

One is this one, just add all the terms to one xapian document. The
gotcha is needing some reindexing facility (we want this for other
reasons, so that might not be so bad).

The second approach that occurs to me is to still add the terms to one
xapian document, but to prefix them with a number identifying the file
copy (1,2, etc). The complexity here is in the generation of queries,
each one needs to be OR_ed with eg. SUBJECT:foo or 1#SUBJECT:foo or
2#SUBJECT:foo. I'm not really sure offhand how to do that without field
processors. I'm also not sure about the performance impact.

The third approach is create extra xapian documents per file, which have
a different document type (from the notmuch point of view). Here the
complexity will be dealing with the returned documents from a xapian
query. We can probably use a wildcard search on the type (mail, mail1,
mail2, etc...) to make the queries reasonably easy. My gut feeling is
that this is the "right" approach, althought it will be a bit more
complicated to get started.  It will also require changing our idea of
threads in the "structured output" where a thread looks something like

(thread
       (message
          (instance/file)
          (instance/file))
       (message
          (instance/file))