Automatic suppression of non-duplicate messages
Eirik Byrkjeflot Anonsen
eirik at eirikba.org
Sun Nov 4 02:06:18 PST 2012
David Bremner <david at tethera.net> writes:
> Eirik Byrkjeflot Anonsen <eirik at eirikba.org> writes:
>
>> That's not what I see. If I search for a term that only appears in
>> one of the "copies", none of the copies are included in the search
>> result.
>
> The offending code is at line 1813 of lib/database.cc; the message is
> only indexed if the message-id is new.
>
> It might be sensible to move _notmuch_message_index_file into the other
> branch of the if, but even if that works fine, something more
> sophisticated is needed for the call to
> __notmuch_message_set_header_values; the invariant that each message has
> a single subject seems reasonable.
Hmm, depends. Assuming indexing is intended to be used for searching,
one might want to search for something that occurs in one subject but
not the other. In practice I doubt it matters.
> Offhand I'm not sure of a good method of automatically deciding what is
> the same message (with e.g. headers and footer text added by a mailing
> list).
I don't think the real problem here is the duplicate detection algorithm
itself. It is rather that notmuch forces a particular duplicate
detection algorithm on its users. Duplicate detection should really be
delegated to a different application, thus allowing people to experiment
with whatever algorithm works best for them. (Just like notmuch
delegates the choice of initial tags on messages to an external
application.)
But first notmuch must be modified so it can sensibly treat multiple
instances having the same message-id as separate messages. That seems
to me to be the hard part. (And some way for external applications to
join and split copies, of course.)
However, if you want an algorithm that is likely to get rid of most
duplicates while keeping most non-duplicates separate, here's a quick
suggestion:
Just to clarify: The goal is to suppress most copies of the same message
while not suppressing a single instance of a different message. It
isn't important if a few duplicate messages makes it through, but it is
imperative that no "real" message is dropped.
To check whether two instances are duplicates, I suspect something like
this algorithm would be "good enough":
- Message-Id must be the same. This isn't actually necessary, but it
makes sense to require it anyway.
- From and Date must be the same. These form important context that may
change the meaning of the message (e.g. "me too" depends heavily on
From, and "let's meet tomorrow" depends heavily on Date). (Are there
more context-supplying headers we should worry about?)
- If Subject and body are also the same, the instances are duplicates.
- Otherwise, if neither of the messages come from a mailing list,
they're probably not duplicates.
- Otherwise, grab a few other (recent) mails from the same mailing list.
If all the bodies end with the same text, ignore that text when
comparing the bodies.
- For the Subject, again use a few other (recent) mails from the same
mailing list for comparison. But this time only look for one of the
well-known common patterns. If all the mails matches the same
pattern, ignore that pattern when comparing the Subject.
- For both of the above, it would be good to pick messages from
different threads, to avoid accidental similarities. I suspect this
is more important for subjects than bodies, though.
- Also, leading and trailing whitespace should probably be dropped.
- (Some other transformations may make sense, such as reflowing text or
converting between character sets. In practice I doubt that will make
much of a difference.)
- If the "canonicalized" body and Subject are the same, the messages are
duplicates. At least there's now pretty much no chance that there is
anything interesting that will be missed by dropping one of the
messages.
(I'm assuming that identifying mailing lists are usually
straightforward, e.g. using the List-Id header).
eirik
More information about the notmuch
mailing list