multilingual notmuch (and Content-Language)
Daniel Kahn Gillmor
dkg at fifthhorseman.net
Mon Mar 19 00:38:07 PDT 2018
On Sun 2018-03-18 21:32:35 +0200, Jani Nikula wrote:
> On Sun, 18 Mar 2018, Daniel Kahn Gillmor <dkg at fifthhorseman.net> wrote:
>> * if we know our index expects english, and we have a message part that
>> *is not* english (e.g. Content-Language: es), we could avoid indexing
>> that part.
>
> Why would we do that? Search mostly works just fine for non-English
> languages, it's just that the *stemming* is not right.
>
>> what do you think? what ideas are missing from the branstorm above? I'd
>> love to hear from people with multilingual mailboxes about how we might
>> be able to make notmuch work better for them.
>
> With my limited understanding of this, stemming happens both at indexing
> and searching. Basically at indexing, the term generator indexes both
> the full and the stemmed version of words. I'm wondering if we could
> look at Content-Language (and missing that, heuristics), and (if the
> user so desires) use multiple term generators with different stemmers on
> a per document basis. Or, use non-stemming indexing for unidentified or
> unsupported languages. How far would that take us? Then, perhaps, we
> could also perform language specific queries?
>
> I don't know how feasible that is, or if it would require Xapian
> changes.
thanks, this is exactly the kind of promising idea i was hoping my dumb
questions and half-baked suggestions would provoke :)
Maybe Olly or someone else with deeper knowledge of xapian can weigh in
about the feasibility of this proposal?
--dkg
More information about the notmuch
mailing list