[PATCH 1/1] Store and search for canonical Unicode text [WIP]

Sun Aug 30 09:21:16 PDT 2015

WARNING: this version is very preliminary, and might eat your data.

Unicode has multiple sequences representing what should normally be
considered the same text.  For example here's a combining Á and a
noncombining Á.

Depending on the way you view this, you may or may not see a
difference, but the former is the canonical form, and is represented
by two Unicode code points: a capital A (U+0041) followed by a
"combining acute accent" (U+0301); the latter is the single code
point (U+00C1), which is probably what most people would type.

Before this change, notmuch would index two strings that differ only
with respect to canonicalization, like tóken and tóken, as separate
terms, even though they may be visually indistinguishable, and do (for
most purposes) represent the same text.  After indexing, searching for
one would not find the other, and which one you present to notmuch
when you search depends on your tools.  See test/T570-normalization.sh
for a working example.

Since we're talking about differing representations that one wouldn't
normally want to distinguish, this patch unifies the various
representations by converting all incoming text to its canonical form
before indexing, and canonicalizing all query strings.

Up to now, notmuch has let Xapian handle converting the incoming bytes
to UTF-8.  Xapian treats any byte sequence as UTF-8, and interprets
any invalid UTF-8 bytes as Latin-1.  This patch maintains the existing
behavior (excepting the new canonicalization) by using Xapian's
Utf8Iterator to handle the initial Unicode character parsing.

Note that the parsing approach in this patch is not particularly
efficient, both because it traverses the incoming bytes three times:

   - once to determine how long the input is (currently the iterator
     can't directly handle null terminated char*'s),

   - once to determine how long the final UTF-8 allocation needs to
     be,

   - and once for the conversion.

And because when the input is already UTF-8, it just blindly converts
from UTF-8 to Unicode code points, and then back to UTF-8 (after
canonicalization), during each pass.  There are certainly
opportunities to optimize, though it may be worth discussing the
detection of data encodings more broadly first.

FIXME: document current encoding behavior clearly in
new/insert/search-terms.

FIXME: what about existing indexed text?
---

 Posted for preliminary discussion, and as a milestone (it appears to
 mostly work now).  Though I doubt I'm handling things correctly
 everywhere notmuch-wise, wrt talloc, etc.

 lib/Makefile.local         |  1 +
 lib/database.cc            | 17 ++++++++--
 lib/message.cc             | 51 +++++++++++++++++++---------
 lib/notmuch.h              |  3 ++
 lib/query.cc               |  6 ++--
 lib/text-util.cc           | 82 ++++++++++++++++++++++++++++++++++++++++++++++
 test/Makefile.local        | 10 ++++--
 test/T150-tagging.sh       | 54 +++++++++++++++++++++++-------
 test/T240-dump-restore.sh  |  4 +--
 test/T480-hex-escaping.sh  |  4 +--
 test/T570-normalization.sh | 28 ++++++++++++++++
 test/corpus/cur/52:2,      |  6 ++--
 test/to-utf8.c             | 44 +++++++++++++++++++++++++
 13 files changed, 267 insertions(+), 43 deletions(-)
 create mode 100644 lib/text-util.cc
 create mode 100755 test/T570-normalization.sh
 create mode 100644 test/to-utf8.c

diff --git a/lib/Makefile.local b/lib/Makefile.local
index 3a07090..41fd1e1 100644
--- a/lib/Makefile.local
+++ b/lib/Makefile.local
@@ -48,6 +48,7 @@ libnotmuch_cxx_srcs =		\
 	$(dir)/index.cc		\
 	$(dir)/message.cc	\
 	$(dir)/query.cc		\
+	$(dir)/text-util.cc	\
 	$(dir)/thread.cc
 
 libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o)
diff --git a/lib/database.cc b/lib/database.cc
index 6a15174..7a01f95 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -436,6 +436,7 @@ find_document_for_doc_id (notmuch_database_t *notmuch, unsigned doc_id)
 char *
 _notmuch_message_id_compressed (void *ctx, const char *message_id)
 {
+    // Assumes message_id is normalized utf-8.
     char *sha1, *compressed;
 
     sha1 = _notmuch_sha1_of_string (message_id);
@@ -457,12 +458,20 @@ notmuch_database_find_message (notmuch_database_t *notmuch,
     if (message_ret == NULL)
 	return NOTMUCH_STATUS_NULL_POINTER;
 
-    if (strlen (message_id) > NOTMUCH_MESSAGE_ID_MAX)
-	message_id = _notmuch_message_id_compressed (notmuch, message_id);
+    const char *u8_id = notmuch_bytes_to_utf8 (notmuch, message_id, -1);
+
+    // Is strlen still appropriate?
+    if (strlen (u8_id) > NOTMUCH_MESSAGE_ID_MAX)
+    {
+	message_id = _notmuch_message_id_compressed (notmuch, u8_id);
+	talloc_free ((char *) u8_id);
+    } else
+	message_id = u8_id;
 
     try {
 	status = _notmuch_database_find_unique_doc_id (notmuch, "id",
 						       message_id, &doc_id);
+	talloc_free ((char *) message_id);
 
 	if (status == NOTMUCH_PRIVATE_STATUS_NO_DOCUMENT_FOUND)
 	    *message_ret = NULL;
@@ -1910,6 +1919,7 @@ _notmuch_database_generate_thread_id (notmuch_database_t *notmuch)
 static char *
 _get_metadata_thread_id_key (void *ctx, const char *message_id)
 {
+    // Assumes message_id is normalized utf-8.
     if (strlen (message_id) > NOTMUCH_MESSAGE_ID_MAX)
 	message_id = _notmuch_message_id_compressed (ctx, message_id);
 
@@ -2011,7 +2021,8 @@ _resolve_message_id_to_thread_id_old (notmuch_database_t *notmuch,
      * generate a new thread ID and store it there.
      */
     db = static_cast <Xapian::WritableDatabase *> (notmuch->xapian_db);
-    metadata_key = _get_metadata_thread_id_key (ctx, message_id);
+    const char *mid = notmuch_message_get_message_id (message);
+    metadata_key =_get_metadata_thread_id_key (ctx, mid);
     thread_id_string = notmuch->xapian_db->get_metadata (metadata_key);
 
     if (thread_id_string.empty()) {
diff --git a/lib/message.cc b/lib/message.cc
index 1ddce3c..afd0264 100644
--- a/lib/message.cc
+++ b/lib/message.cc
@@ -225,20 +225,28 @@ _notmuch_message_create_for_message_id (notmuch_database_t *notmuch,
     unsigned int doc_id;
     char *term;
 
-    *status_ret = (notmuch_private_status_t) notmuch_database_find_message (notmuch,
-									    message_id,
-									    &message);
-    if (message)
+    const char *u8_id = notmuch_bytes_to_utf8 (notmuch, message_id, -1);
+    *status_ret =
+	(notmuch_private_status_t) notmuch_database_find_message (notmuch,
+								  u8_id,
+								  &message);
+    if (message) {
+	talloc_free ((char *) u8_id);
 	return talloc_steal (notmuch, message);
-    else if (*status_ret)
+    } else if (*status_ret) {
+	talloc_free ((char *) u8_id);
 	return NULL;
+    }
 
     /* If the message ID is too long, substitute its sha1 instead. */
-    if (strlen (message_id) > NOTMUCH_MESSAGE_ID_MAX)
-	message_id = _notmuch_message_id_compressed (message, message_id);
-
-    term = talloc_asprintf (NULL, "%s%s",
-			    _find_prefix ("id"), message_id);
+    // Strlen still OK?
+    if (strlen (u8_id) > NOTMUCH_MESSAGE_ID_MAX) {
+	message_id = _notmuch_message_id_compressed (message, u8_id);
+	talloc_free ((char *) u8_id);
+    } else
+	message_id = u8_id;
+
+    term = talloc_asprintf (NULL, "%s%s", _find_prefix ("id"), message_id);
     if (term == NULL) {
 	*status_ret = NOTMUCH_PRIVATE_STATUS_OUT_OF_MEMORY;
 	return NULL;
@@ -252,6 +260,7 @@ _notmuch_message_create_for_message_id (notmuch_database_t *notmuch,
 	talloc_free (term);
 
 	doc.add_value (NOTMUCH_VALUE_MESSAGE_ID, message_id);
+	talloc_free ((char *) message_id);
 
 	doc_id = _notmuch_database_generate_doc_id (notmuch);
     } catch (const Xapian::Error &error) {
@@ -1109,13 +1118,14 @@ _notmuch_message_gen_terms (notmuch_message_t *message,
     if (text == NULL)
 	return NOTMUCH_PRIVATE_STATUS_NULL_POINTER;
 
+    const char *u8_text = notmuch_bytes_to_utf8(NULL, text, -1);
     term_gen->set_document (message->doc);
 
     if (prefix_name) {
 	const char *prefix = _find_prefix (prefix_name);
 
 	term_gen->set_termpos (message->termpos);
-	term_gen->index_text (text, 1, prefix);
+	term_gen->index_text (u8_text, 1, prefix);
 	/* Create a gap between this an the next terms so they don't
 	 * appear to be a phrase. */
 	message->termpos = term_gen->get_termpos () + 100;
@@ -1124,10 +1134,11 @@ _notmuch_message_gen_terms (notmuch_message_t *message,
     }
 
     term_gen->set_termpos (message->termpos);
-    term_gen->index_text (text);
+    term_gen->index_text (u8_text);
     /* Create a term gap, as above. */
     message->termpos = term_gen->get_termpos () + 100;
 
+    talloc_free ((char *) u8_text);
     return NOTMUCH_PRIVATE_STATUS_SUCCESS;
 }
 
@@ -1184,10 +1195,14 @@ notmuch_message_add_tag (notmuch_message_t *message, const char *tag)
     if (tag == NULL)
 	return NOTMUCH_STATUS_NULL_POINTER;
 
-    if (strlen (tag) > NOTMUCH_TAG_MAX)
+    const char *u8_tag = notmuch_bytes_to_utf8 (message, tag, -1);
+    if (strlen (u8_tag) > NOTMUCH_TAG_MAX) {
+	talloc_free ((char *) u8_tag);
 	return NOTMUCH_STATUS_TAG_TOO_LONG;
+    }
 
-    private_status = _notmuch_message_add_term (message, "tag", tag);
+    private_status = _notmuch_message_add_term (message, "tag", u8_tag);
+    talloc_free ((char *) u8_tag);
     if (private_status) {
 	INTERNAL_ERROR ("_notmuch_message_add_term return unexpected value: %d\n",
 			private_status);
@@ -1212,10 +1227,14 @@ notmuch_message_remove_tag (notmuch_message_t *message, const char *tag)
     if (tag == NULL)
 	return NOTMUCH_STATUS_NULL_POINTER;
 
-    if (strlen (tag) > NOTMUCH_TAG_MAX)
+    const char *u8_tag = notmuch_bytes_to_utf8 (message, tag, -1);
+    if (strlen (u8_tag) > NOTMUCH_TAG_MAX) {
+	talloc_free ((char *) u8_tag);
 	return NOTMUCH_STATUS_TAG_TOO_LONG;
+    }
 
-    private_status = _notmuch_message_remove_term (message, "tag", tag);
+    private_status = _notmuch_message_remove_term (message, "tag", u8_tag);
+    talloc_free ((char *) u8_tag);
     if (private_status) {
 	INTERNAL_ERROR ("_notmuch_message_remove_term return unexpected value: %d\n",
 			private_status);
diff --git a/lib/notmuch.h b/lib/notmuch.h
index b1f5bfa..6e13eb1 100644
--- a/lib/notmuch.h
+++ b/lib/notmuch.h
@@ -1759,6 +1759,9 @@ notmuch_filenames_move_to_next (notmuch_filenames_t *filenames);
 void
 notmuch_filenames_destroy (notmuch_filenames_t *filenames);
 
+char *
+notmuch_bytes_to_utf8 (const void *ctx, const char *bytes, const size_t len);
+
 /* @} */
 
 NOTMUCH_END_DECLS
diff --git a/lib/query.cc b/lib/query.cc
index 5275b5a..e48f06a 100644
--- a/lib/query.cc
+++ b/lib/query.cc
@@ -86,7 +86,7 @@ notmuch_query_create (notmuch_database_t *notmuch,
 
     query->notmuch = notmuch;
 
-    query->query_string = talloc_strdup (query, query_string);
+    query->query_string = notmuch_bytes_to_utf8 (query, query_string, -1);
 
     query->sort = NOTMUCH_SORT_NEWEST_FIRST;
 
@@ -125,7 +125,9 @@ notmuch_query_get_sort (notmuch_query_t *query)
 void
 notmuch_query_add_tag_exclude (notmuch_query_t *query, const char *tag)
 {
-    char *term = talloc_asprintf (query, "%s%s", _find_prefix ("tag"), tag);
+    const char *u8_tag = notmuch_bytes_to_utf8 (query, tag, -1);
+    char *term = talloc_asprintf (query, "%s%s", _find_prefix ("tag"), u8_tag);
+    talloc_free ((char *) u8_tag);
     _notmuch_string_list_append (query->exclude_terms, term);
 }
 
diff --git a/lib/text-util.cc b/lib/text-util.cc
new file mode 100644
index 0000000..9dfd31f
--- /dev/null
+++ b/lib/text-util.cc
@@ -0,0 +1,82 @@
+/* text-util.cc - notmuch text processing utility functions
+ *
+ * Copyright (C) 2015 Rob Browning <rlb at defaultvalue.org>
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see http://www.gnu.org/licenses/ .
+ *
+ * Author: Rob Browning <rlb at defaultvalue.org>
+ *
+ */
+
+#include "notmuch.h"
+
+#include <assert.h>
+#include <glib.h>
+#include <string.h>
+#include <talloc.h>
+#include <xapian.h>
+
+static gsize
+_notmuch_decompose_to_utf8 (const gunichar uc, gchar *out)
+{
+    gunichar dc[G_UNICHAR_MAX_DECOMPOSITION_LENGTH];
+    // This currently performs canonical decomposition.
+    const gsize dcn =
+	g_unichar_fully_decompose (uc, FALSE, dc,
+				   G_UNICHAR_MAX_DECOMPOSITION_LENGTH);
+    gsize utf8_len = 0;
+    for (gsize i = 0; i < dcn; i++)
+    {
+	const gint dc_bytes = g_unichar_to_utf8 (dc[i], out);
+	utf8_len += dc_bytes;
+	if (out != NULL)
+	    out += dc_bytes;
+    }
+    return utf8_len;
+}
+
+/* Convert a sequence of bytes to UTF-8, handling input encodings as
+ * Xapian does, but produce the canonical encoding.
+ */
+char *
+notmuch_bytes_to_utf8(const void *ctx, const char *bytes, const size_t len)
+{
+    // FIXME: try/catch to convert to error status messages?  Can the
+    // iterator throw?
+    Xapian::Utf8Iterator it;
+    gsize u8_len = 0;
+
+    // Compute the utf-8 length
+    if (len == (size_t) -1)
+	it.assign (bytes, strlen(bytes));
+    else
+	it.assign (bytes, len);
+    for (unsigned uc = *it; uc != unsigned(-1); it++, uc = *it)
+	u8_len += _notmuch_decompose_to_utf8 (uc, NULL);
+
+    // Convert to utf-8
+    if (len == (size_t) -1)
+	it.assign (bytes, strlen(bytes));
+    else
+	it.assign (bytes, len);
+    char *result = talloc_array (ctx, char, u8_len + 1);
+    gsize u8_i = 0;
+    for (unsigned uc = *it; uc != unsigned(-1); it++, uc = *it) {
+	const gsize dc_bytes = _notmuch_decompose_to_utf8 (uc, &(result[u8_i]));
+	u8_i += dc_bytes;
+    }
+    assert (u8_i == u8_len);
+    result[u8_i] = '\0';
+    return result;
+}
diff --git a/test/Makefile.local b/test/Makefile.local
index 2331ceb..fd6d06d 100644
--- a/test/Makefile.local
+++ b/test/Makefile.local
@@ -15,8 +15,11 @@ smtp_dummy_modules = $(smtp_dummy_srcs:.c=.o)
 $(dir)/arg-test: $(dir)/arg-test.o command-line-arguments.o util/libutil.a
 	$(call quiet,CC) $^ -o $@ $(LDFLAGS)
 
-$(dir)/hex-xcode: $(dir)/hex-xcode.o command-line-arguments.o util/libutil.a
-	$(call quiet,CC) $^ -o $@ $(LDFLAGS) $(TALLOC_LDFLAGS)
+$(dir)/hex-xcode: $(dir)/hex-xcode.o command-line-arguments.o lib/libnotmuch.a util/libutil.a
+	$(call quiet,CC) $^ -o $@ $(LDFLAGS) $(TALLOC_LDFLAGS) $(CONFIGURE_LDFLAGS)
+
+$(dir)/to-utf8: $(dir)/to-utf8.o command-line-arguments.o lib/libnotmuch.a util/libutil.a
+	$(call quiet,CC) $^ -o $@ $(LDFLAGS) $(TALLOC_LDFLAGS) $(CONFIGURE_LDFLAGS)
 
 random_corpus_deps =  $(dir)/random-corpus.o  $(dir)/database-test.o \
 			notmuch-config.o command-line-arguments.o \
@@ -46,7 +49,8 @@ test_main_srcs=$(dir)/arg-test.c \
 	      $(dir)/parse-time.c \
 	      $(dir)/smtp-dummy.c \
 	      $(dir)/symbol-test.cc \
-	      $(dir)/make-db-version.cc \
+	      $(dir)/to-utf8.c \
+	      $(dir)/make-db-version.cc
 
 test_srcs=$(test_main_srcs) $(dir)/database-test.c
 
diff --git a/test/T150-tagging.sh b/test/T150-tagging.sh
index 821d393..d983fe0 100755
--- a/test/T150-tagging.sh
+++ b/test/T150-tagging.sh
@@ -2,6 +2,14 @@
 test_description='"notmuch tag"'
 . ./test-lib.sh || exit 1
 
+canonicalize_encoding()
+{
+  local decoded u8
+  decoded=$($TEST_DIRECTORY/hex-xcode --direction=decode "$1") || return 1
+  u8=$($TEST_DIRECTORY/to-utf8 "$decoded") || return 1
+  $TEST_DIRECTORY/hex-xcode --direction=encode "$u8"
+}
+
 add_message '[subject]=One'
 add_message '[subject]=Two'
 
@@ -191,23 +199,45 @@ test_expect_equal_file EXPECTED OUTPUT
 test_begin_subtest '--batch: unicode tags'
 notmuch dump --format=batch-tag > BACKUP
 
+# FIXME: test canonical and non-canonical output?
+
+enctag1='%2a@%7d%cf%b5%f4%85%80%adO3%da%a7'
+enctag2='=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d'
+enctag3='A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27'
+enctag4='%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6'
+enctag5='%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d'
+enctag6='L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1'
+enctag7='P%c4%98%2f'
+enctag8='%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d'
+enctag9='%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b'
+
 notmuch tag --batch <<EOF
-+%2a@%7d%cf%b5%f4%85%80%adO3%da%a7 -- One
-+=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d -- One
-+A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27 -- One
++$enctag1 -- One
++$enctag2 -- One
++$enctag3 -- One
 +R -- One
-+%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6 -- One
-+%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d -- One
-+L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1 -- One
-+P%c4%98%2f -- One
-+%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d -- One
-+%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b -- One
-+%2a@%7d%cf%b5%f4%85%80%adO3%da%a7  +=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d  +A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27  +R  +%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6  +%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d  +L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1  +P%c4%98%2f  +%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d  +%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b -- Two
++$enctag4 -- One
++$enctag5 -- One
++$enctag6 -- One
++$enctag7 -- One
++$enctag8 -- One
++$enctag9 -- One
++$enctag1  +$enctag2  +$enctag3  +R  +$enctag4  +$enctag5  +$enctag6  +$enctag7  +$enctag8  +$enctag9 -- Two
 EOF
 
+# FIXME: double-check that we need all of these, or do we want to do everything?
+cetag1=$(canonicalize_encoding "$enctag1") || exit 1
+cetag2=$(canonicalize_encoding "$enctag2") || exit 1
+cetag4=$(canonicalize_encoding "$enctag4") || exit 1
+cetag5=$(canonicalize_encoding "$enctag5") || exit 1
+cetag6=$(canonicalize_encoding "$enctag6") || exit 1
+cetag7=$(canonicalize_encoding "$enctag7") || exit 1
+cetag8=$(canonicalize_encoding "$enctag8") || exit 1
+cetag9=$(canonicalize_encoding "$enctag9") || exit 1
+
 cat <<EOF > EXPECTED
-+%2a@%7d%cf%b5%f4%85%80%adO3%da%a7 +=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d +A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27 +L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1 +P%c4%98%2f +R +inbox +tag4 +tag5 +unread +%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d +%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b +%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6 +%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d -- id:msg-002 at notmuch-test-suite
-+%2a@%7d%cf%b5%f4%85%80%adO3%da%a7 +=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d +A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27 +L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1 +P%c4%98%2f +R +inbox +tag5 +unread +%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d +%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b +%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6 +%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d -- id:msg-001 at notmuch-test-suite
++$cetag1 +$cetag2 +$enctag3 +$cetag6 +$cetag7 +R +inbox +tag4 +tag5 +unread +$cetag8 +$cetag9 +$cetag4 +$cetag5 -- id:msg-002 at notmuch-test-suite
++$cetag1 +$cetag2 +$enctag3 +$cetag6 +$cetag7 +R +inbox +tag5 +unread +$cetag8 +$cetag9 +$cetag4 +$cetag5 -- id:msg-001 at notmuch-test-suite
 EOF
 
 notmuch dump --format=batch-tag | sort > OUTPUT
diff --git a/test/T240-dump-restore.sh b/test/T240-dump-restore.sh
index e6976ff..37722fb 100755
--- a/test/T240-dump-restore.sh
+++ b/test/T240-dump-restore.sh
@@ -164,7 +164,7 @@ enc1=$($TEST_DIRECTORY/hex-xcode --direction=encode "$tag1")
 tag2=$(printf 'this\n tag\t has\n spaces')
 enc2=$($TEST_DIRECTORY/hex-xcode --direction=encode "$tag2")
 
-enc3='%c3%91%c3%a5%c3%b0%c3%a3%c3%a5%c3%a9-%c3%8f%c3%8a'
+enc3='N%cc%83a%cc%8a%c3%b0a%cc%83a%cc%8ae%cc%81-I%cc%88E%cc%82'
 tag3=$($TEST_DIRECTORY/hex-xcode --direction=decode $enc3)
 
 notmuch dump --format=batch-tag > BACKUP
@@ -218,7 +218,7 @@ test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count
 
 test_begin_subtest 'format=batch-tag, checking encoded output'
 notmuch dump --format=batch-tag -- from:cworth |\
-	 awk "{ print \"+$enc1 +$enc2 +$enc3 -- \" \$5 }" > EXPECTED.$test_count
+	 awk "{ print \"+$enc3 +$enc1 +$enc2 -- \" \$5 }" > EXPECTED.$test_count
 notmuch dump --format=batch-tag -- from:cworth  > OUTPUT.$test_count
 test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count
 
diff --git a/test/T480-hex-escaping.sh b/test/T480-hex-escaping.sh
index 10527b1..b9c5eac 100755
--- a/test/T480-hex-escaping.sh
+++ b/test/T480-hex-escaping.sh
@@ -19,7 +19,7 @@ $TEST_DIRECTORY/hex-xcode --direction=encode  < EXPECTED.$test_count |\
 test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count
 
 test_begin_subtest "round trip 8bit chars"
-echo '%c3%91%c3%a5%c3%b0%c3%a3%c3%a5%c3%a9-%c3%8f%c3%8a' > EXPECTED.$test_count
+echo 'N%cc%83a%cc%8a%c3%b0a%cc%83a%cc%8ae%cc%81-I%cc%88E%cc%82' > EXPECTED.$test_count
 $TEST_DIRECTORY/hex-xcode --direction=decode  < EXPECTED.$test_count |\
     $TEST_DIRECTORY/hex-xcode --direction=encode > OUTPUT.$test_count
 test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count
@@ -42,7 +42,7 @@ $TEST_DIRECTORY/hex-xcode --in-place --direction=encode  < EXPECTED.$test_count
 test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count
 
 test_begin_subtest "round trip 8bit chars (in-place)"
-echo '%c3%91%c3%a5%c3%b0%c3%a3%c3%a5%c3%a9-%c3%8f%c3%8a' > EXPECTED.$test_count
+echo 'N%cc%83a%cc%8a%c3%b0a%cc%83a%cc%8ae%cc%81-I%cc%88E%cc%82' > EXPECTED.$test_count
 $TEST_DIRECTORY/hex-xcode --in-place --direction=decode  < EXPECTED.$test_count |\
     $TEST_DIRECTORY/hex-xcode --in-place --direction=encode > OUTPUT.$test_count
 test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count
diff --git a/test/T570-normalization.sh b/test/T570-normalization.sh
new file mode 100755
index 0000000..ee3fa94
--- /dev/null
+++ b/test/T570-normalization.sh
@@ -0,0 +1,28 @@
+#!/usr/bin/env bash
+
+test_description="text normalization"
+
+. ./test-lib.sh || exit 1
+
+combining_a='Á'
+noncombining_a='Á'
+
+# FIXME: these are extraneous/vestigial, remove from the final patch if still
+# unneeded.
+combining_o='ó' # should be U+006f U+0301
+noncombining_o='ó' # U+00f3 latin small letter o with acute
+# utf-8:
+#   combining: o b11001100 b10000001 (o 0xcc 0x81)
+#   non-combining: b11000011 b10110011 (0xc3 0xb3)
+combining_token='tóken' # should be U+006f U+0301
+normalized_token='tóken' # should be U+0243
+
+test_begin_subtest "Term with combining characters"
+add_message '[content-type]="text/plain; charset=unknown-8bit"' \
+	    '[subject]="reproduc$noncombining_a"' \
+	    '[body]="reproduc$noncombining_a"'
+output=$(notmuch count "reproduc$combining_a" 2>&1 | notmuch_show_sanitize_all)
+
+test_expect_equal "$output" 1
+
+test_done
diff --git a/test/corpus/cur/52:2, b/test/corpus/cur/52:2,
index 6028340..852e2bd 100644
--- a/test/corpus/cur/52:2,
+++ b/test/corpus/cur/52:2,
@@ -12,8 +12,8 @@ Content-Type: text/plain; charset=ISO-8859-1
 Content-Transfer-Encoding: 8bit
 Subject: Re: [aur-general] Guidelines: cp, mkdir vs install
 
-Le 29/12/2011 11:13, Allan McRae a écrit :
-> On 29/12/11 19:56, François Boulogne wrote:
+Le 29/12/2011 11:13, Allan McRae a écrit :
+> On 29/12/11 19:56, François Boulogne wrote:
 >> Hi,
 >>
 >> Looking to improve the quality of my packages, I read again the guidelines.
@@ -35,5 +35,5 @@ Thank you Allan
 
 
 -- 
-François Boulogne.
+François Boulogne.
 https://www.sciunto.org
diff --git a/test/to-utf8.c b/test/to-utf8.c
new file mode 100644
index 0000000..17bf40d
--- /dev/null
+++ b/test/to-utf8.c
@@ -0,0 +1,44 @@
+/* to-utf8.cc - convert bytes to UTF-8 as notmuch would
+ *
+ * usage:
+ * to-utf8 [bytes ...]
+ *
+ * Copyright (C) 2015 Rob Browning <rlb at defaultvalue.org>
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see http://www.gnu.org/licenses/ .
+ *
+ * Author: Rob Browning <rlb at defaultvalue.org>
+ *
+ */
+
+#include "notmuch.h"
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <talloc.h>
+
+int
+main (int argc, char **argv)
+{
+    void *ctx = talloc_new (NULL);
+
+    for (int i = 1; i < argc; i++) {
+	char *u8 = notmuch_bytes_to_utf8(ctx, argv[i], -1);
+	fputs (u8, stdout);
+	talloc_free (u8);
+    }
+
+    talloc_free (ctx);
+    return 0;
+}
-- 
2.5.0