multilingual notmuch (and Content-Language)
Servilio Afre Puentes
afrepues at mcmaster.ca
Wed Mar 21 08:05:58 PDT 2018
On Sun, Mar 18 2018, Daniel Kahn Gillmor wrote:
> https://tools.ietf.org/html/rfc3282 describes a Content-Language:
> header. https://tools.ietf.org/html/rfc8255 describes
> a multipart/multilingual Content-Type.
>
> notmuch currently uses xapian with a hard-coded English stemmer which
> works great for me as a monolingual American, but limits the
> applicability of notmuch to Anglophiles (people who speak English).
> That makes me sad.
>
> AIUI, xapian is pretty much committed to being a single-language
> indexer.
Have you seen the different stemmers it already has? Reference:
https://xapian.org/docs/sourcedoc/html/dir_430c089e7e18d7ac6ff937a35cc3312c.html
> But i just wanted to point out that it's possible that we
> could be smarter about this in notmuch, and wanted to make a space for
> possible design discussion.
>
> a few concrete suggestions (intended as brainstorming, feedback welcome):
>
> * if we know our index expects english, and we have a message part that
> *is not* english (e.g. Content-Language: es), we could avoid indexing
> that part.
I'd prefer leaving the choice of default stemmer to the user.
> * during indexing, we could add a property to each message when we
> discover a Content-Language header. this would let you do something
> like "notmuch search property:lang=es" to find all messages
> explicitly tagged as spanish.
>
> * (pretty crazy) If we're willing to search in another language we
> could add an additional xapian database configured that language, and
> we could index identified parts in that language.
Do we need to have separate DB if we can use different stemmers dynamically?
> * for text parts without a Content-Language: header, we could do some
> concrete heuristics to guess the language. For example, choose the
> 1000 most popular words for each language we might know about, and
> look for their presence in the text. Choose the language that is
> most heavily represented, and store it in the index as a property.
> this could be combined with the suggestions above.
+1 for heuristics.
> what do you think? what ideas are missing from the branstorm above? I'd
> love to hear from people with multilingual mailboxes about how we might
> be able to make notmuch work better for them.
As an actively bilingual person (English and Spanish), I love this idea.
Servilio
--
Servilio Afre Puentes
Programmer/Analyst, SHARCNET project
RHPCS | http://www.rhpcs.mcmaster.ca
SHARCNET | https://sharcnet.ca
Compute Ontario | http://computeontario.ca
Compute/Calcul Canada | http://computecanada.ca
905-525-9140, x22540
More information about the notmuch
mailing list