locales and notmuch

Wed Jun 19 06:09:34 PDT 2019

(sorry for the late reply to this thread)

On Thu 2019-02-21 15:11:48 -0400, David Bremner wrote:
> to be unique case-insensitively, so I decided to convert them to lower
> case on input. This turns out to be "fun", if we try to handle things
> other than ASCII.  So one option is to just insist prefixes are ASCII.
>
> Otherwise we could insist they are UTF-8, ignoring the locale. The
> fullest generality (I think) is to first convert from the users locale
> to utf8, as in the attached sample program.

I don't think this discussion fully covers just how "fun" this
conversion is.

Even if we assume UTF-8 in the database (which i think we should),
making something all lower-case is locale-dependent.  The classic
example, iirc, is that in most UTF-8 locales, U+0049 LATIN CAPITAL
LETTER I downcases to U+0069 LATIN SMALL LETTER I, but in tr_TR
(Turkish), it downcases to U+0131 LATIN SMALL LETTER DOTLESS I.  (and
upper-casing U+0069 LATIN SMALL LETTER I in tr_TR yields U+0130 LATIN
CAPITAL LETTER I WITH DOT ABOVE)

Similarly, if there's anything that the DB cares about collation for,
that also varies dramatically across UTF-8 locales.

sigh.

I have no problem with asserting that all character strings in the
notmuch database are UTF-8.  That's just the only sane thing to do in
2019.  But if we build any feature into notmuch that makes assumptions
or requirements about upper-casing, lower-casing, or collating strings,
and that feature interacts between the currently-running locale and
whatever locale was used to store data in the the database in the past,
and those locales can differ, we may be inflicting some subtle pain on
users.

(note that i'm assuming in this discussion that we're *just* talking
about metadata -- notmuch configuration options, explicit xapian terms,
etc, but *not* the indexed text of the messages, which is an entirely
different kettle of fish)

I see two protective approaches for handling this simply yet being clear
about our concerns.  Both methods introduce a clear dependency on some
UTF-8 locale, in the way that we also have clear dependencies on GMime
or Xapian.

 a) assert that all text strings in the notmuch db's metadata are
    C.UTF-8, and enforce this explicitly in the codebase.

or,

 b) upon database initialization, select a UTF-8 locale (probably based
    on the user's locale during "notmuch setup") and store it in the
    database (perhaps reporting and displaying it via a "notmuch config"
    value).  If any locale-dependent function is used against
    in-database metadata while a *different* locale is active in the
    environment, warn that this mismatch is happening, and prefer the
    locale stored in the db.

I don't have the capacity to work on this kind of safeguard right now,
but someone who wants to learn more about locales and notmuch could try
to implement it and we could see what happens.  Being explicit about the
concern like this might help to raise the profile of the specific risky
codepaths, which in turn could prompt someone to make a more
sophisticated and useful fix than either of the guardrails described
above.

        --dkg
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 227 bytes
Desc: not available
URL: <http://notmuchmail.org/pipermail/notmuch/attachments/20190619/389eee3e/attachment.sig>