multilingual notmuch (and Content-Language)

Daniel Kahn Gillmor dkg at fifthhorseman.net
Mon Mar 19 00:38:07 PDT 2018


On Sun 2018-03-18 21:32:35 +0200, Jani Nikula wrote:
> On Sun, 18 Mar 2018, Daniel Kahn Gillmor <dkg at fifthhorseman.net> wrote:
>>  * if we know our index expects english, and we have a message part that
>>    *is not* english (e.g. Content-Language: es), we could avoid indexing
>>    that part.
>
> Why would we do that? Search mostly works just fine for non-English
> languages, it's just that the *stemming* is not right.
>
>> what do you think?  what ideas are missing from the branstorm above?  I'd
>> love to hear from people with multilingual mailboxes about how we might
>> be able to make notmuch work better for them.
>
> With my limited understanding of this, stemming happens both at indexing
> and searching. Basically at indexing, the term generator indexes both
> the full and the stemmed version of words. I'm wondering if we could
> look at Content-Language (and missing that, heuristics), and (if the
> user so desires) use multiple term generators with different stemmers on
> a per document basis. Or, use non-stemming indexing for unidentified or
> unsupported languages. How far would that take us? Then, perhaps, we
> could also perform language specific queries?
>
> I don't know how feasible that is, or if it would require Xapian
> changes.

thanks, this is exactly the kind of promising idea i was hoping my dumb
questions and half-baked suggestions would provoke :)

Maybe Olly or someone else with deeper knowledge of xapian can weigh in
about the feasibility of this proposal?

          --dkg


More information about the notmuch mailing list