[RFC2 Patch 5/5] lib: iterator API for message properties

Thu Jun 2 10:33:54 PDT 2016

Hi Bremner--

thanks for the response!  I didn't mean my post to be a wet-blanket,
just wanted to think through the tradeoffs...

On Wed 2016-06-01 19:29:59 -0400, David Bremner wrote:
> I guess if you don't care about the possibility of iterating all pairs
> with given key prefix (which I admit makes more sense for the config
> API), then the code could be simplified to look more like the tag list
> handling code.  C is pretty crap at generics, but I guess looking at
> tags.c, it's really about iterators for notmuch_string_list_t. So it
> could probably be generalized to serve here.
>
> For each such prefix, one would need to roughly duplicate patches 1/5
> and 3/5.  It took me a little while to figure 1/5 out, but now that I
> know, it would be less trouble.  I guess my thinking here was that I
> would provide a low level interface that people using the C API or
> bindings could use without hacking xapian.
 [...]
> XPROPERTY is an internal prefix, which means it isn't added to the query
> parser.  As it happens, I didn't plan on CLI access to these terms
> either. Both of those choices are tradeoffs to say that these are
> internal metadata, suitable for manipulation by programs. Such programs
> could be scripts using python or ruby.

I think this makes sense, and makes me more comfortable with the overall
idea of this patch series.  maybe it'd be useful to clearly document the
intended scope?

>> If we add new specific features, we could potentially augment the dump
>> format explicitly for them, without having the property abstraction.
>
> We could, but I think should change the dump format quite rarely, since
> we risk breaking people's scripts. So if we did it for one prefix, I'd
> like to do in an extensible way so that adding new prefixes is somewhat
> transparent. It also means some duplication of effort/code in notmuch
> dump/restore to dump/restore each new prefix.
>
> It's probably true that per-prefix dump format would be more compact,
> since the keys would be implicit, rather than repeated for every pair.

true, though i'm not sure how much compactness is necessary.  presumably
people are compressing their dumpfiles, and regularly repeated strings
are the easiest thing to compress.

>> We already have some explicit features for each message (subject,
>> from, to, attachment, mimetype, thread id, etc), and most of them are
>> derived from the message itself, with the hope that it could be
>> re-derived given just the message body.  Is there a distinction
>> between properties that can be derived from the message body and
>> properties that need to be additionally derived from some other data?
>
> As Tomi always says, naming is the hardest thing; properties is a bit
> generic. I'm not sure the distinction you make between the "message" and
> the "message body" here. I think most of our derived terms are from the
> message header.  My intent here is that "properties" are used for things
> that cannot be derived from the message (header or body).

To be clear, i didn't mean to distinguish betweeen "message" and
"message body" -- i don't think of the headers as being significantly
different from the body (and indeed, if we can get memoryhole working,
then some headers might be derived from or influenced by the body).

maybe it's worth thinking through each of these per-message features,
and where they come from -- are they from the message itself (header,
body, etc), from the message's position(s) in the filesystem, or
somewhere else entirely?

From the message:

 * message-id
 * subject
 * mimetype
 * attachment
 * references
 * from
 * to
 * replyto

From the filesystem itself:

 * filenames
 * folder

From elsewhere:

 * for messages which have multiple files, which file is actually indexed
 * thread-id
 * tag

we're now talking about adding properties, which are in the "elsewhere"
category, right?

It's worth noticing that the stuff in "elsewhere" is the stuff that
won't propagate across a dump/restore unless it's explicitly in the dump
somehow.   We currently fail to restore thread-id and which file is
actually indexed across a dump/restore :/

>      - per prefix requires new code in the library and dump/restore
>        for every prefix
>      + the dump format might be more compact if done in a per prefix way.
>      + this code would be simpler than the generic properties code,
>        mainly because it would not need key value pairs,
>      - the library and dump/restore are parts of notmuch that have the
>        potential to "break the world".  Not too many people are
>        comfortable hacking on them.
>      - changing the dump format is something like an ABI change for
>        people whose scripts rely on dump / restore.

I think you've convinced me that it's good to go ahead with the
properties, assuming it's scoped as defined above.  I still think that
we need a better story for upgrades to the dump format in general, but
maybe this isn't the place to make that particular case.

            --dkg
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 948 bytes
Desc: not available
URL: <http://notmuchmail.org/pipermail/notmuch/attachments/20160602/5ca4cc1f/attachment.sig>