Deduplication ?

Mon Jun 2 11:29:26 PDT 2014

On Mon, 02 Jun 2014, David Edmondson <david.edmondson at oracle.com> wrote:
> On Mon, Jun 02 2014, Jani Nikula wrote:
>>>> One should also have some message content heuristics to determine that the
>>>> content is indeed duplicate and not something totally different (not that
>>>> we can see the different content anyway... but...)
>>>
>>> That would be nice.
>>
>> And quite hard.
>
> Thinking about this a bit...
>
> The headers are likely to be different, so you could remove them (get
> rid of everything up to the first empty line).
>
> Various mailing lists add footers, so you would need to remove them (a
> regular expression based approach would catch most of them easily).

This may work for text/plain messages, but for mime messages (and I
think text/html too) an extra layer of mime structure is usually
added. The problem becomes matching a subtree of mime structure, and
deciding the non-matching layer is noise that can be ignored. The
mailing list manager adding the extra layer may also decode and
reconstruct the existing parts instead of using them as-is.

> The remaining content should be the same for identical messages, so a
> sensible hash (md5) could be used to compare.
>
> Although, some MTAs modify the body of the message when manipulating
> encoding. I don't know how to address this.

Let's assume we can figure it all out and find the duplicates. The
question remains, which one to save and which ones to remove? For list
mail, perhaps you'd like to save the copy you received through the list
so you know it's list mail (and you could search for it using list-id:
header *cough* if we indexed that *cough*). Or perhaps you'd like to
save the copy you received directly because some lists let people have
their addresses filtered from cc: header before distributing.

More useful would probably be raising some flags if the heuristics
detect messages with the same message-id that are clearly *different*
messages. (Perhaps that's what Tomi was after to begin with?)

Finally, I personally wouldn't want any duplicates removed; rather I'd
like notmuch to index information across all duplicates, and provide UI
features to see the alternatives if desired.

BR,
Jani.