Deduplication ?

Mon Jun 2 10:06:09 PDT 2014

On Mon, 02 Jun 2014, Mark Walters <markwalters1009 at gmail.com> wrote:
> Tomi Ollila <tomi.ollila at iki.fi> writes:
>
>> On Mon, Jun 02 2014, Mark Walters <markwalters1009 at gmail.com> wrote:
>>
>>> Vladimir Marek <Vladimir.Marek at oracle.com> writes:
>>> If you want to save disk space then you could delete the duplicates
>>> after with something like
>>>
>>> notmuch search --output=files --format=text0 --duplicate=2 '*' piped to
>>> xargs -0
>>
>> What if there are 3 duplicates (or 4... ;)
>
> I was assuming that it was merging 2 duplicate-free bunches of messages,
> but I guess the new 100000 might not be. In that case running the above
> repeatedly (ie until it is a no-op) would be fine. 

With 'notmuch new' in between the runs, obviously.

Alternatively, find the biggest --duplicate=N which still outputs
something, and run the command for each N...2.

>> One should also have some message content heuristics to determine that the
>> content is indeed duplicate and not something totally different (not that
>> we can see the different content anyway... but...)
>
> That would be nice.

And quite hard.

BR,
Jani.