how to search for hyphenated words? (was: how to search for Morse code?)

Gregor Zattler telegraph at gmx.net
Tue Mar 12 00:34:19 PDT 2019


Hi David, Matt, Carl, notmuch developers,
* David Bremner <david at tethera.net> [2019-03-11; 22:13]:
> Matt Armstrong <marmstrong at google.com> writes:
>> Carl Worth <cworth at cworth.org> writes:
>>> The trick here is that when notmuch is indexing body text it feeds it
>>> into a Xapian function that parses the text by finding "terms" in the
>>> text. And this parser considers both punctuation and whitespace as
>>> separators between terms.
>>
>> I notice that Xapian supports something called "phrase searches",
>> documented as:
>>
>>   "A phrase surrounded with double quotes ("") matches documents
>>   containing that exact phrase. Hyphenated words are also treated as
>>   phrases, as are cases such as filenames and email addresses
>>   (e.g. /etc/passwd or president at whitehouse.gov)."
>>
>> I assume that this particular Xapian feature is unavailable in notmuch?
>> If so, I wonder if enabling has ever been considered?
>
> It is enabled, and documented in notmuch-search-terms(7). Unfortunately
> I don't think it's related to the original request. The mention of
> hyphenated words is about the input to the query parser, not the
> (necessarily) the retrieved text.

what I do not understand is that it dosn't matter if I search for

org-notmuch

or

"org-notmuch"

'"org-notmuch"'

or even

org ADJ/1 notmuch

$ notmuch count --output=messages '"org-notmuch"'
581
$ notmuch count --output=messages 'org-notmuch'
581
$ notmuch count --output=messages org-notmuch
581
$ notmuch count --output=messages org ADJ/1 notmuch
581

a typical example of a matched message is the attached one.
Somehow the search matches the address of this very mailing list
in the body of the email (I assume).


But obviously there are much more emails with this address in
them:

$ notmuch count --output=messages 'notmuch at notmuchmail.org'
27396
$ notmuch count --output=messages '"notmuch at notmuchmail.org"'
27396

Or with a naive search (no decoding of possible base64 encoded
parts) there are

$ find /home/grfz/Mail/~ml/emacs-orgmode at gnu.org /home/grfz/Mail/~ml/notmuch at notmuchmail.org* -type f -print0 | xargs -0r grep -l -- 'notmuch at notmuchmail.org' | xargs -IXXXX sh -c "cat XXXX | sed -e '1,/^$/ d' | grep -c notmuch at notmuchmail.org " | egrep -c "1|2|3|4|5|6|7|8|9"
16795

emails with the address at least once in the body.


Therefore I wonder why notmuch matches 581 messages.



A naive search for org-notmuch on the files (no decoding of
possible base64 encoded parts) only shows 79 files (77 unique
emails):

mkdir -vp /tmp/test/{cur,new,tmp}

$ find /home/grfz/Mail/~ml/emacs-orgmode at gnu.org /home/grfz/Mail/~ml/notmuch at notmuchmail.org* -type f -print0 | xargs -0r grep -l -- 'org-notmuch' | xargs ln -vs --target-directory=/tmp/kolp/cur/ | wc -l
79


Therefore I wonder why notmuch matches 581 messages, not 16795
messages or 77 messages.


Somehow these numbers do not fit!?


Ciao; Gregor
-- 
 -... --- .-. . -.. ..--.. ...-.-
-------------- next part --------------
An embedded message was scrubbed...
From: root at len.workgroup (Cron Daemon)
Subject: Cron <grfz at len> ~/bin/mailwiederdurchschleusen
Date: Fri, 29 Dec 2017 17:00:09 +0100
Size: 1571
URL: <http://notmuchmail.org/pipermail/notmuch/attachments/20190312/1a8976d3/attachment.mht>


More information about the notmuch mailing list