accented characters

David Bremner david at tethera.net
Mon Nov 13 05:22:36 PST 2017


Bruno Deremble <bruno.deremble at ens.fr> writes:

> A way to handle this could be to only index non accented words which
> requires to add a filter before the indexing process. I looked at the code
> and it seems that this should be handled by gmime? 
> there are also libraries that are supposed to do that such as 'unac'.
>
> Is it something that you have been exploring already?

We have discussed it a bit (with another francophone, in copy ;) ), but
I think no-one got very far.

I guess the ideal case would be to have the possibility of for both
accented and accent free search. That would require adding some more
terms to the index (both accented and unaccented version). It's not
clear to me yet what kind of performance impact that would have. 

Xapian already has something called "stemmers" (in xapian-core/languages
in the source tree), which do, among other things, strip accents. Those
are generally targetted at a single language, which I suspect is not
very useful for notmuch (even I as a mostly-unilingual person have a
fair amount of English, French, and German in my mailstore). Nonetheless
a custom stemmer might be the right way to go, since that step is
happening anyway.  Or perhaps people would be happy enough with being
able to set the stemmer (currently it is hardcoded to English). That
would be a relatively easy change to notmuch, but I don't know how many
people would find it a good tradeoff to lose English stemming
(i.e. search for 'stem' and 'stemming' being equivalant) for
de-accenting.

I'm not sure if the query language would need to support the
distinction between accented and unaccented searches. I imagine that
people naturally type the non-accented versions in a search, but I do
wonder about cases like (German) München. Should that be stemmed to
Munchen or Muenchen ?

The other thing I don't know is how many people would be happy with just
stripping all accents. That could be done in a gmime filter, as you
suggest. That would be more likely to require changes to the query
language. Off hand I don't know how to transparently de-accent all query
words.

d





-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 658 bytes
Desc: not available
URL: <http://notmuchmail.org/pipermail/notmuch/attachments/20171113/5c37ea52/attachment.sig>


More information about the notmuch mailing list