Deduplication ?

Mark Walters markwalters1009 at gmail.com
Mon Jun 2 07:15:39 PDT 2014



Mark Walters <markwalters1009 at gmail.com> writes:

> Vladimir Marek <Vladimir.Marek at oracle.com> writes:
>
>>> > I want to import bigger chunk of archived messages into my notmuch
>>> > database. It's about 100k messages. The problem is, that I most probably
>>> > have quite a lot of those messages in the DB. Basically I would like to
>>> > add only those I don't have already.
>>> >
>>> > There are two possibilities
>>> >
>>> > a) I will add all the 100k messages and then remove the duplicities.
>>> >
>>> > b) I will write a script which will parse the message ID's of the
>>> >    to-be-added messages and try to match them to the notmuch DB. Adding
>>> >    only files I can't find already.
>>> >
>>> > Ad b) might be better option, but I started to play with the idea of
>>> > deduplication. I'm thinking about listing all the message IDs stored in
>>> > DB, listing all files belonging to the IDs and deleting all but one.
>>> > Also I'm thinking about implementing some simple algorithm telling me
>>> > whether the messages are really very similar. Just to be sure I don't
>>> > delete something I don't want to.
>>> >
>>> > Was anyone playing with the idea?
>>> 
>>> notsync[1] used the (lack of) existence of a message id in the store to
>>> decide whether to add something from an IMAP server, but it is old,
>>> crufty, unused and unloved code.
>>
>> I see, that's close to my b) solution, thanks!
>
> Did you mean a) here? The idea was to add them all first and then run
> this script to delete the duplicates.
>

Sorry: out of order arrival times and lack of care on my part. Sorry!

MW

> Best wishes
>
> Mark
>
>> -- 
>> 	Vlad
>> _______________________________________________
>> notmuch mailing list
>> notmuch at notmuchmail.org
>> http://notmuchmail.org/mailman/listinfo/notmuch


More information about the notmuch mailing list