Deduplication ?

David Edmondson david.edmondson at oracle.com
Mon Jun 2 06:43:43 PDT 2014


On Mon, Jun 02 2014, Vladimir Marek wrote:
> Hi,
>
> I want to import bigger chunk of archived messages into my notmuch
> database. It's about 100k messages. The problem is, that I most probably
> have quite a lot of those messages in the DB. Basically I would like to
> add only those I don't have already.
>
> There are two possibilities
>
> a) I will add all the 100k messages and then remove the duplicities.
>
> b) I will write a script which will parse the message ID's of the
>    to-be-added messages and try to match them to the notmuch DB. Adding
>    only files I can't find already.
>
> Ad b) might be better option, but I started to play with the idea of
> deduplication. I'm thinking about listing all the message IDs stored in
> DB, listing all files belonging to the IDs and deleting all but one.
> Also I'm thinking about implementing some simple algorithm telling me
> whether the messages are really very similar. Just to be sure I don't
> delete something I don't want to.
>
> Was anyone playing with the idea?

notsync[1] used the (lack of) existence of a message id in the store to
decide whether to add something from an IMAP server, but it is old,
crufty, unused and unloved code.

> -- 
> 	Vlad
> _______________________________________________
> notmuch mailing list
> notmuch at notmuchmail.org
> http://notmuchmail.org/mailman/listinfo/notmuch

Footnotes: 
[1]  https://github.com/dme/notsync



More information about the notmuch mailing list