[PATCH] Add configurable changed tag to messages that have been changed on disk

Fri Apr 11 09:03:38 PDT 2014

David Bremner <david at tethera.net> writes:

>> Exactly.  It could be a tick, or just the current time of day if your
>> clock does not go backwards.  (I'd be willing to do a full scan if the
>> clock ever goes backwards.)  The advantage of time is that you don't
>> have to synchronously update some counter.
>
> I think I'd lean towards global time so that one could use it to resolve
> conflicts between changes to multiple copies of the database.

I, too, would prefer to use time.  However, I'm doubtful it would help
resolve conflicts.  On the plus side, I'm not sure it is even needed to
resolve conflicts.  My mail synchronizer has an algorithm for resolving
conflicts that always works without human intervention and in my limited
experience does exactly what I want:

   * If there's a conflict between two replicas, ensure that each
     maildir ends up with the maximum number of the number copies of the
     message in each of the two databases being reconciled.  [Example:
     If replica A deletes a message and replica B moves it from folder
     INBOX to folder SPAM, you end up with a copy in spam.  If replica A
     moves a message to folder IMPORTANT and replica B moves it to SPAM,
     then you get two hard links to the same file, one in IMPORTANT and
     one in SPAM.]

   * If there's a conflict and two replicas have different tags on the
     same message, then the tags in notmuch's new.tags directive get
     logically ANDed, while all other tags get logically ORed.

Granted, I've only been using this system for a week.  On the other
hand, all I was doing was starting to test something I had written, yet
it ended up being so much better than my old system that I couldn't go
back and ended up using my system in production far earlier than
anticipated...

>> Making sure the write-operations update the time should be easy.  Most
>> or all of the changes are probably funneled through
>> _notmuch_message_sync.  Worst case, there are only 9 places in the
>> source code that make use of a Xapian:WritableDatabase, so I'm pretty
>> confident total changes wouldn't be much more than 50 lines of code.
>
> Maybe. Don't forget upgrading the database, updating the test suite, and
> presumably some changes to the CLI so the new mtime can actually be
> used. Not to be discouraging ;).

The CLI is trivial.  We'll just add another search keyword ctime
analogous to date.

As far as updating the test suite, etc., it's almost certain that the
core notmuch developers would be unsatisfied with whatever I've done,
since the code base is very clean and has a very uniform style.  So when
I say I'd want some "indication that such a change could be upstreamed,"
I mean more specifically that someone would be willing to shepherd the
process of getting the code into shape.

> In the ensuing time, nothing better has developed for tag
> synchronization (my pet use case) so maybe it's time to pursue this
> again.

I do have something pretty good for tag synchronization.  It requires a
full database scan each time to detect changes, but I've heavily
optimized it to be very fast by skipping over the notmuch library and
directly scanning the underlying Xapian Btrees.  Currently my bottleneck
is indexing messages (e.g., running notmuch new or calling
notmuch_database_add_message), which are painfully slow on 32-bit
machines.  (Unfortunately my mail server is a 32-bit machine.)

To give you an idea, on a 32 bit machine, if I get a handful of new mail
(e.g., 6 messages), running "notmuch new" takes 19 seconds, while
scanning the database to check for renames and changed tags adds another
1.4 seconds.  On a 64-bit machine, "notmuch new" might take 1 second,
while scanning the database adds 350 msec.

So full database scan's might not be the end of the world.  The biggest
performance bottleneck at this point is notmuch's painful indexing
performance.  It kills me that it takes 10 minutes to index 100,000 mail
messages on a 16-core machine with 48 GiB of RAM.  But the library is
non-reentrant and allocates thread IDs in such a way that it's hard to
create parallel databases and later merge them.  Basically I can't
figure out how to make productive use of more than one CPU core even
when synchronizing across 1GB Ethernet!

It's pretty beta, but my intention is to open-source my code, so glad
for beta testers if you are interested in testing tag synchronization.

> It would be good to have some preliminary idea about the time
> and space costs of adding document mtimes.  I guess database bloat
> should not be too bad, since it's only 64bits (?) per mail message.

Plus a Btree to index it, so figure at least 24 bytes per message.
Another issue is that values are always brought into memory with a
document, so it will consume more RAM.  But yeah, I don't think it
should be that bad.

David