[RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated
Jani Nikula
jani at nikula.org
Thu Feb 28 11:41:16 PST 2013
Hi Aaron -
On Tue, 26 Feb 2013, Aaron Ecay <aaronecay at gmail.com> wrote:
> Presently, the code which finds the parent of a message as it is being
> added to the database assumes that the first Message-ID-like substring
> of the In-Reply-To header is the parent Message ID. Some mail clients,
> however, put stuff other than the Message-ID of the parent in the
> In-Reply-To header, such as the email address of the sender of the
> parent. This can fool notmuch.
I think the background is that RFC 822 defines In-Reply-To (and
References too for that matter) as *(phrase / msg-id), while RFC 2822
defines them as 1*msg-id. I'd like something about RFC 822 being
mentioned in the commit message.
The problem in the gmane message you link to in
id:87liaa3luc.fsf at gmail.com is likely related to the FAQ item 05.26 "How
do I fix a bogus In-Reply-To or missing References field?" in the MH FAQ
http://www.newt.com/faq/mh.html.
> The updated algorithm prefers the last Message ID in the References
> header. The References header lists messages oldest-first, so the last
> Message ID is the parent (RFC2822, p. 24). The References header is
> also less likely to be in a non-standard
> syntax (http://cr.yp.to/immhf/thread.html,
> http://www.jwz.org/doc/threading.html). In case the References header
> is not to be found, fall back to the old behavior.
> ---
>
> I especially notice this problem on public mailing lists, where
> certain people's messages always cause an "out-dent" of the threading,
> instead of being nested under whichever message they are replies to.
>
> Technically, putting non-Message-ID crud in the In-Reply-To field is a
> violation of RFC2822, but it appears that in practice the References
> header is respected more often than the In-Reply-To one.
>
> lib/database.cc | 30 ++++++++++++++++++++++--------
> 1 file changed, 22 insertions(+), 8 deletions(-)
>
> diff --git a/lib/database.cc b/lib/database.cc
> index 91d4329..cbf33ae 100644
> --- a/lib/database.cc
> +++ b/lib/database.cc
> @@ -501,8 +501,10 @@ _parse_message_id (void *ctx, const char *message_id, const char **next)
> * 'message_id' in the result (to avoid mass confusion when a single
> * message references itself cyclically---and yes, mail messages are
> * not infrequent in the wild that do this---don't ask me why).
> + *
> + * Return the last reference parsed.
> */
> -static void
> +static char *
> parse_references (void *ctx,
> const char *message_id,
> GHashTable *hash,
> @@ -511,7 +513,7 @@ parse_references (void *ctx,
> char *ref;
>
> if (refs == NULL || *refs == '\0')
> - return;
> + return NULL;
>
> while (*refs) {
> ref = _parse_message_id (ctx, refs, &refs);
> @@ -519,6 +521,8 @@ parse_references (void *ctx,
> if (ref && strcmp (ref, message_id))
> g_hash_table_insert (hash, ref, NULL);
> }
> +
> + return ref;
As the comment for the function says, we explicitly avoid including
self-references. I think I'd err on the safe side and return NULL if the
last ref equals message-id.
> }
>
> notmuch_status_t
> @@ -1365,7 +1369,7 @@ _notmuch_database_generate_doc_id (notmuch_database_t *notmuch)
> notmuch->last_doc_id++;
>
> if (notmuch->last_doc_id == 0)
> - INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");
> + INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");
I don't know how you got this non-change hunk here, but please remove
it. :)
>
> return notmuch->last_doc_id;
> }
> @@ -1509,7 +1513,7 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch,
> const char **thread_id)
> {
> GHashTable *parents = NULL;
> - const char *refs, *in_reply_to, *in_reply_to_message_id;
> + const char *refs, *in_reply_to, *in_reply_to_message_id, *last_ref_message_id;
> GList *l, *keys = NULL;
> notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
>
> @@ -1517,21 +1521,31 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch,
> _my_talloc_free_for_g_hash, NULL);
>
> refs = notmuch_message_file_get_header (message_file, "references");
> - parse_references (message, notmuch_message_get_message_id (message),
> - parents, refs);
> + last_ref_message_id = parse_references (message,
> + notmuch_message_get_message_id (message),
> + parents, refs);
>
> in_reply_to = notmuch_message_file_get_header (message_file, "in-reply-to");
> parse_references (message, notmuch_message_get_message_id (message),
> parents, in_reply_to);
>
> - /* Carefully avoid adding any self-referential in-reply-to term. */
> in_reply_to_message_id = _parse_message_id (message, in_reply_to, NULL);
I wonder if you should reuse your parse_references() change here, so
you'd set in_reply_to_message_id to the last message-id in
In-Reply-To. This might tackle some of the problematic cases directly,
but should still be all right per RFC 2822. I didn't verify how the
parser handles an RFC 2822 violating free form header though.
> + /* If the parent message ID from the Reply-To and References
> + * headers are different, use the References one. This is because
> + * the Reply-To header is more likely to be in an non-standard
> + * format. */
> + if (in_reply_to_message_id &&
> + last_ref_message_id &&
> + strcmp (last_ref_message_id, in_reply_to_message_id)) {
> + in_reply_to_message_id = last_ref_message_id;
> + }
I suggest adding an else if branch (or revamp the above if condition) to
tackle the missing In-Reply-To header:
else if (!in_reply_to_message_id && last_ref_message_id) {
in_reply_to_message_id = last_ref_message_id;
}
> + /* Carefully avoid adding any self-referential in-reply-to term. */
> if (in_reply_to_message_id &&
> strcmp (in_reply_to_message_id,
> notmuch_message_get_message_id (message)))
If you change parse_references() to be careful about never returning a
self-reference, and set in_reply_to_message_id from there, I think you
can drop the strcmp here. And move the comment to an appropriate place.
Thanks for the patch, I think we should do this. But this is an area
where I think we need to be careful, so another reviewer wouldn't
harm. Some tests for this would be good too, obviously.
BR,
Jani.
> {
> _notmuch_message_add_term (message, "replyto",
> - _parse_message_id (message, in_reply_to, NULL));
> + in_reply_to_message_id);
> }
>
> keys = g_hash_table_get_keys (parents);
> --
> 1.8.1.4
> _______________________________________________
> notmuch mailing list
> notmuch at notmuchmail.org
> http://notmuchmail.org/mailman/listinfo/notmuch
More information about the notmuch
mailing list