[RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated

Jani Nikula jani at nikula.org
Fri Mar 1 09:06:28 PST 2013


On Thu, 28 Feb 2013, Jani Nikula <jani at nikula.org> wrote:
> Hi Aaron -
>
> On Tue, 26 Feb 2013, Aaron Ecay <aaronecay at gmail.com> wrote:
>> Presently, the code which finds the parent of a message as it is being
>> added to the database assumes that the first Message-ID-like substring
>> of the In-Reply-To header is the parent Message ID.  Some mail clients,
>> however, put stuff other than the Message-ID of the parent in the
>> In-Reply-To header, such as the email address of the sender of the
>> parent.  This can fool notmuch.
>
> I think the background is that RFC 822 defines In-Reply-To (and
> References too for that matter) as *(phrase / msg-id), while RFC 2822
> defines them as 1*msg-id. I'd like something about RFC 822 being
> mentioned in the commit message.
>
> The problem in the gmane message you link to in
> id:87liaa3luc.fsf at gmail.com is likely related to the FAQ item 05.26 "How
> do I fix a bogus In-Reply-To or missing References field?" in the MH FAQ
> http://www.newt.com/faq/mh.html.
>
>> The updated algorithm prefers the last Message ID in the References
>> header.  The References header lists messages oldest-first, so the last
>> Message ID is the parent (RFC2822, p. 24).  The References header is
>> also less likely to be in a non-standard
>> syntax (http://cr.yp.to/immhf/thread.html,
>> http://www.jwz.org/doc/threading.html).  In case the References header
>> is not to be found, fall back to the old behavior.
>> ---
>>
>> I especially notice this problem on public mailing lists, where
>> certain people's messages always cause an "out-dent" of the threading,
>> instead of being nested under whichever message they are replies to.
>>
>> Technically, putting non-Message-ID crud in the In-Reply-To field is a
>> violation of RFC2822, but it appears that in practice the References
>> header is respected more often than the In-Reply-To one.
>>
>>  lib/database.cc | 30 ++++++++++++++++++++++--------
>>  1 file changed, 22 insertions(+), 8 deletions(-)
>>
>> diff --git a/lib/database.cc b/lib/database.cc
>> index 91d4329..cbf33ae 100644
>> --- a/lib/database.cc
>> +++ b/lib/database.cc
>> @@ -501,8 +501,10 @@ _parse_message_id (void *ctx, const char *message_id, const char **next)
>>   * 'message_id' in the result (to avoid mass confusion when a single
>>   * message references itself cyclically---and yes, mail messages are
>>   * not infrequent in the wild that do this---don't ask me why).
>> + *
>> + * Return the last reference parsed.
>>  */
>> -static void
>> +static char *
>>  parse_references (void *ctx,
>>  		  const char *message_id,
>>  		  GHashTable *hash,
>> @@ -511,7 +513,7 @@ parse_references (void *ctx,
>>      char *ref;
>>
>>      if (refs == NULL || *refs == '\0')
>> -	return;
>> +	return NULL;
>>
>>      while (*refs) {
>>  	ref = _parse_message_id (ctx, refs, &refs);
>> @@ -519,6 +521,8 @@ parse_references (void *ctx,
>>  	if (ref && strcmp (ref, message_id))
>>  	    g_hash_table_insert (hash, ref, NULL);
>>      }
>> +
>> +    return ref;
>
> As the comment for the function says, we explicitly avoid including
> self-references. I think I'd err on the safe side and return NULL if the
> last ref equals message-id.
>
>>  }
>>
>>  notmuch_status_t
>> @@ -1365,7 +1369,7 @@ _notmuch_database_generate_doc_id (notmuch_database_t *notmuch)
>>      notmuch->last_doc_id++;
>>
>>      if (notmuch->last_doc_id == 0)
>> -	INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");
>> +	INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");
>
> I don't know how you got this non-change hunk here, but please remove
> it. :)
>
>>
>>      return notmuch->last_doc_id;
>>  }
>> @@ -1509,7 +1513,7 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch,
>>  					   const char **thread_id)
>>  {
>>      GHashTable *parents = NULL;
>> -    const char *refs, *in_reply_to, *in_reply_to_message_id;
>> +    const char *refs, *in_reply_to, *in_reply_to_message_id, *last_ref_message_id;
>>      GList *l, *keys = NULL;
>>      notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
>>
>> @@ -1517,21 +1521,31 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch,
>>  				     _my_talloc_free_for_g_hash, NULL);
>>
>>      refs = notmuch_message_file_get_header (message_file, "references");
>> -    parse_references (message, notmuch_message_get_message_id (message),
>> -		      parents, refs);
>> +    last_ref_message_id = parse_references (message,
>> +					    notmuch_message_get_message_id (message),
>> +					    parents, refs);
>>
>>      in_reply_to = notmuch_message_file_get_header (message_file, "in-reply-to");
>>      parse_references (message, notmuch_message_get_message_id (message),
>>  		      parents, in_reply_to);
>>
>> -    /* Carefully avoid adding any self-referential in-reply-to term. */
>>      in_reply_to_message_id = _parse_message_id (message, in_reply_to, NULL);
>
> I wonder if you should reuse your parse_references() change here, so
> you'd set in_reply_to_message_id to the last message-id in
> In-Reply-To. This might tackle some of the problematic cases directly,
> but should still be all right per RFC 2822. I didn't verify how the
> parser handles an RFC 2822 violating free form header though.

Strike that based on http://www.jwz.org/doc/threading.html:

"If there are multiple things in In-Reply-To that look like Message-IDs,
only use the first one of them: odds are that the later ones are
actually email addresses, not IDs."

>> +    /* If the parent message ID from the Reply-To and References
>> +     * headers are different, use the References one.  This is because
>> +     * the Reply-To header is more likely to be in an non-standard
>> +     * format. */
>> +    if (in_reply_to_message_id &&
>> +	last_ref_message_id &&
>> +	strcmp (last_ref_message_id, in_reply_to_message_id)) {
>> +	in_reply_to_message_id = last_ref_message_id;
>> +    }

I talked to Austin (CC) about the patch on IRC, and his comment was,
perceptive as always:

 23:38        amdragon   Is the logic in that patch equivalent to always using
                         the last message ID in references unless there is no
                         references header?  Seems like it is, but in a
                         convoluted way.

And that's actually the case, isn't it? To make the code reflect that,
you should use last_ref_message_id, and if that's NULL, fallback to
in_reply_to_message_id.

> I suggest adding an else if branch (or revamp the above if condition) to
> tackle the missing In-Reply-To header:
>
>     else if (!in_reply_to_message_id && last_ref_message_id) {
>         in_reply_to_message_id = last_ref_message_id;
>     }

Strike that, it should be the other way round.

>
>> +    /* Carefully avoid adding any self-referential in-reply-to term. */
>>      if (in_reply_to_message_id &&
>>  	strcmp (in_reply_to_message_id,
>>  		notmuch_message_get_message_id (message)))
>
> If you change parse_references() to be careful about never returning a
> self-reference, and set in_reply_to_message_id from there, I think you
> can drop the strcmp here. And move the comment to an appropriate place.

Strike that.

> Thanks for the patch, I think we should do this. But this is an area
> where I think we need to be careful, so another reviewer wouldn't
> harm. Some tests for this would be good too, obviously.

I got this part right. :)


Jani.

>
>
> BR,
> Jani.
>
>>      {
>>  	_notmuch_message_add_term (message, "replyto",
>> -			     _parse_message_id (message, in_reply_to, NULL));
>> +			     in_reply_to_message_id);
>>      }
>>
>>      keys = g_hash_table_get_keys (parents);
>> --
>> 1.8.1.4
>> _______________________________________________
>> notmuch mailing list
>> notmuch at notmuchmail.org
>> http://notmuchmail.org/mailman/listinfo/notmuch


More information about the notmuch mailing list