[Patch v4] lib: regexp matching in 'subject' and 'from'

Tomi Ollila tomi.ollila at iki.fi
Wed Jan 25 11:40:08 PST 2017


On Sat, Jan 21 2017, David Bremner <david at tethera.net> wrote:

> the idea is that you can run
>
> % notmuch search subject:/<your-favourite-regexp>/
> % notmuch search from:/<your-favourite-regexp>/

I like this interface.

>
> or
>
> % notmuch search subject:"your usual phrase search"
> % notmuch search from:"usual phrase search"
>
> This should also work with bindings, since it extends the query parser.
>
> This is trivial to extend for other value slots, but currently the only
> value slots are date, message_id, from, subject, and last_mod. Date is
> already searchable, and message_id is not obviously useful to regex
> match.

Why would not mesasge_id not be useful to regex match. I can come up quite
a few use cases... but if there are techinal difficulties... then that
should be mentioned instead.

maybe this commit message should inform that xapian with field processors
(1.4.x) is required for this feature -- and emphasize it a bit better in
manual page ?

Probably '//' is used to escape '/' -- should such a character ever needed
in regex search.

>
> This was originally written by Austin Clements, and ported to Xapian
> field processors (from Austin's custom query parser) by yours truly.
> ---
>
> This version impliments the use of // to delimit regular expressions.
> I have not tested the code paths with old (pre field processor) xapian.

Fedora 25 has 1.2.24 -- T630 tests are skipped. It looks like these changes
did not increase the failure count there.

Some (mostly whitespace nitpicking) comments below:


>
>  doc/man7/notmuch-search-terms.rst |  27 +++++++-
>  lib/Makefile.local                |   1 +
>  lib/database-private.h            |   2 +
>  lib/database.cc                   |  29 +++++++-
>  lib/regexp-fields.cc              | 142 ++++++++++++++++++++++++++++++++++++++
>  lib/regexp-fields.h               |  77 +++++++++++++++++++++
>  test/T630-regexp-query.sh         |  82 ++++++++++++++++++++++
>  7 files changed, 354 insertions(+), 6 deletions(-)
>  create mode 100644 lib/regexp-fields.cc
>  create mode 100644 lib/regexp-fields.h
>  create mode 100755 test/T630-regexp-query.sh
>
> diff --git a/doc/man7/notmuch-search-terms.rst b/doc/man7/notmuch-search-terms.rst
> index de93d733..d8527e18 100644
> --- a/doc/man7/notmuch-search-terms.rst
> +++ b/doc/man7/notmuch-search-terms.rst
> @@ -34,10 +34,14 @@ indicate user-supplied values):
>  
>  -  from:<name-or-address>
>  
> +-  from:/<regex>/
> +
>  -  to:<name-or-address>
>  
>  -  subject:<word-or-quoted-phrase>
>  
> +-  subject:/<regex>/
> +
>  -  attachment:<word>
>  
>  -  mimetype:<word>
> @@ -71,6 +75,17 @@ subject of an email. Searching for a phrase in the subject is supported
>  by including quotation marks around the phrase, immediately following
>  **subject:**.
>  
> +The **from:** and **subject** prefix can be also used to restrict the
> +results to those whose from/subject value matches a regular
> +expression (see **regex(7)**) delimited with //.
> +
> +::
> +
> +   notmuch search 'from:/bob at .*[.]example[.]com/'
> +
> +Regular expression searches are only available if notmuch is built
> +with **Xapian Field Processors** (see below).

And the poor user stopped reading far before this line, desperately trying
the regex searches... >;/ so IMO this requirement should be notified earlier.

> +
>  The **attachment:** prefix can be used to search for specific filenames
>  (or extensions) of attachments to email messages.
>  
> @@ -220,13 +235,18 @@ Boolean and Probabilistic Prefixes
>  ----------------------------------
>  
>  Xapian (and hence notmuch) prefixes are either **boolean**, supporting
> -exact matches like "tag:inbox"  or **probabilistic**, supporting a more flexible **term** based searching. The prefixes currently supported by notmuch are as follows.
> -
> +exact matches like "tag:inbox" or **probabilistic**, supporting a more
> +flexible **term** based searching. Certain **special** prefixes are
> +processed by notmuch in a way not stricly fitting either of Xapian's
> +built in styles. The prefixes currently supported by notmuch are as
> +follows.
>  
>  Boolean
>     **tag:**, **id:**, **thread:**, **folder:**, **path:**, **property:**
>  Probabilistic
> -   **from:**, **to:**, **subject:**, **attachment:**, **mimetype:**
> +  **to:**, **attachment:**, **mimetype:**
> +Special
> +   **from:**, **query:**, **subject:**
>  
>  Terms and phrases
>  -----------------
> @@ -396,6 +416,7 @@ Currently the following features require field processor support:
>  
>  - non-range date queries, e.g. "date:today"
>  - named queries e.g. "query:my_special_query"
> +- regular expression searches, e.g. "subject:/^\\[SPAM\\]/"
>  
>  SEE ALSO
>  ========
> diff --git a/lib/Makefile.local b/lib/Makefile.local
> index b77e5780..ff812b5f 100644
> --- a/lib/Makefile.local
> +++ b/lib/Makefile.local
> @@ -52,6 +52,7 @@ libnotmuch_cxx_srcs =		\
>  	$(dir)/query.cc		\
>  	$(dir)/query-fp.cc      \
>  	$(dir)/config.cc	\
> +	$(dir)/regexp-fields.cc     \

Space instead of TAB above -- tab is used more often (and \:s usually aligned)

>  	$(dir)/thread.cc
>  
>  libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o)
> diff --git a/lib/database-private.h b/lib/database-private.h
> index ccc1e9a1..9f5659a9 100644
> --- a/lib/database-private.h
> +++ b/lib/database-private.h
> @@ -190,6 +190,8 @@ struct _notmuch_database {
>  #if HAVE_XAPIAN_FIELD_PROCESSOR
>      Xapian::FieldProcessor *date_field_processor;
>      Xapian::FieldProcessor *query_field_processor;
> +    Xapian::FieldProcessor *from_field_processor;
> +    Xapian::FieldProcessor *subject_field_processor;
>  #endif
>      Xapian::ValueRangeProcessor *last_mod_range_processor;
>  };
> diff --git a/lib/database.cc b/lib/database.cc
> index 2d19f20c..8a9ad251 100644
> --- a/lib/database.cc
> +++ b/lib/database.cc
> @@ -21,6 +21,7 @@
>  #include "database-private.h"
>  #include "parse-time-vrp.h"
>  #include "query-fp.h"
> +#include "regexp-fields.h"
>  #include "string-util.h"
>  
>  #include <iostream>
> @@ -272,12 +273,16 @@ static prefix_t BOOLEAN_PREFIX_EXTERNAL[] = {
>      { "folder",			"XFOLDER:" },
>  };
>  
> -static prefix_t PROBABILISTIC_PREFIX[]= {
> +static prefix_t REGEX_PREFIX[]= {
>      { "from",			"XFROM" },
> +    { "subject",		"XSUBJECT"},
> +};
> +
> +static prefix_t PROBABILISTIC_PREFIX[]= {
> +

empty line ^

>      { "to",			"XTO" },
>      { "attachment",		"XATTACHMENT" },
>      { "mimetype",		"XMIMETYPE"},
> -    { "subject",		"XSUBJECT"},
>  };
>  
>  const char *
> @@ -295,6 +300,11 @@ _find_prefix (const char *name)
>  	    return BOOLEAN_PREFIX_EXTERNAL[i].prefix;
>      }
>  
> +    for (i = 0; i < ARRAY_SIZE (REGEX_PREFIX); i++) {
> +	if (strcmp (name, REGEX_PREFIX[i].name) == 0)
> +	    return REGEX_PREFIX[i].prefix;
> +    }
> +
>      for (i = 0; i < ARRAY_SIZE (PROBABILISTIC_PREFIX); i++) {
>  	if (strcmp (name, PROBABILISTIC_PREFIX[i].name) == 0)
>  	    return PROBABILISTIC_PREFIX[i].prefix;
> @@ -1042,6 +1052,10 @@ notmuch_database_open_verbose (const char *path,
>  	notmuch->query_parser->add_boolean_prefix("date", notmuch->date_field_processor);
>  	notmuch->query_field_processor = new QueryFieldProcessor (*notmuch->query_parser, notmuch);
>  	notmuch->query_parser->add_boolean_prefix("query", notmuch->query_field_processor);
> +	notmuch->from_field_processor = new RegexpFieldProcessor ("from", *notmuch->query_parser, notmuch);
> +	notmuch->subject_field_processor = new RegexpFieldProcessor ("subject", *notmuch->query_parser, notmuch);
> +	notmuch->query_parser->add_boolean_prefix("from", notmuch->from_field_processor);
> +	notmuch->query_parser->add_boolean_prefix("subject", notmuch->subject_field_processor);
>  #endif
>  	notmuch->last_mod_range_processor = new Xapian::NumberValueRangeProcessor (NOTMUCH_VALUE_LAST_MOD, "lastmod:");
>  
> @@ -1058,7 +1072,12 @@ notmuch_database_open_verbose (const char *path,
>  	    notmuch->query_parser->add_boolean_prefix (prefix->name,
>  						       prefix->prefix);
>  	}
> -
> +#if !HAVE_XAPIAN_FIELD_PROCESSOR
> +	for (i = 0; i < ARRAY_SIZE (REGEX_PREFIX); i++) {
> +	    prefix_t *prefix = &REGEX_PREFIX[i];
> +	    notmuch->query_parser->add_prefix (prefix->name, prefix->prefix);
> +	}
> +#endif
>  	for (i = 0; i < ARRAY_SIZE (PROBABILISTIC_PREFIX); i++) {
>  	    prefix_t *prefix = &PROBABILISTIC_PREFIX[i];
>  	    notmuch->query_parser->add_prefix (prefix->name, prefix->prefix);
> @@ -1138,6 +1157,10 @@ notmuch_database_close (notmuch_database_t *notmuch)
>      notmuch->date_field_processor = NULL;
>      delete notmuch->query_field_processor;
>      notmuch->query_field_processor = NULL;
> +    delete notmuch->from_field_processor;
> +    notmuch->from_field_processor = NULL;
> +    delete notmuch->subject_field_processor;
> +    notmuch->subject_field_processor = NULL;
>  #endif
>  
>      return status;
> diff --git a/lib/regexp-fields.cc b/lib/regexp-fields.cc
> new file mode 100644
> index 00000000..8cb1cada
> --- /dev/null
> +++ b/lib/regexp-fields.cc
> @@ -0,0 +1,142 @@
> +/* regexp-fields.cc - field processor glue for regex supporting fields
> + *
> + * This file is part of notmuch.
> + *
> + * Copyright © 2015 Austin Clements
> + * Copyright © 2016 David Bremner
> + *
> + * This program is free software: you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation, either version 3 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see https://www.gnu.org/licenses/ .
> + *
> + * Author: Austin Clements <aclements at csail.mit.edu>
> + *                David Bremner <david at tethera.net>
> + */
> +
> +#include "regexp-fields.h"
> +#include "notmuch-private.h"
> +#include "database-private.h"
> +#include <stdio.h>
> +
> +#if HAVE_XAPIAN_FIELD_PROCESSOR
> +static void
> +compile_regex (regex_t &regexp, const char *str)
> +{
> +    int err = regcomp (&regexp, str, REG_EXTENDED | REG_NOSUB);
> +
> +    if (err != 0) {
> +	size_t len = regerror (err, &regexp, NULL, 0);
> +	char *buffer = new char[len];
> +	std::string msg;
> +	(void) regerror (err, &regexp, buffer, len);
> +	msg.assign (buffer, len);
> +	delete buffer;
> +
> +	throw Xapian::QueryParserError (msg);
> +

empty line ^

> +    }
> +}
> +
> +RegexpPostingSource::RegexpPostingSource (Xapian::valueno slot, const std::string &regexp)
> +    : slot_ (slot)
> +{
> +

ditto

> +    compile_regex (regexp_, regexp.c_str ());
> +}
> +
> +RegexpPostingSource::~RegexpPostingSource ()
> +{
> +    regfree (&regexp_);
> +}
> +
> +void
> +RegexpPostingSource::init (const Xapian::Database &db)
> +{
> +    db_ = db;
> +    it_ = db_.valuestream_begin (slot_);
> +    end_ = db.valuestream_end (slot_);
> +    started_ = false;
> +}
> +
> +Xapian::doccount
> +RegexpPostingSource::get_termfreq_min () const
> +{
> +    return 0;
> +}
> +
> +Xapian::doccount
> +RegexpPostingSource::get_termfreq_est () const
> +{
> +    return get_termfreq_max () / 2;
> +}
> +
> +Xapian::doccount
> +RegexpPostingSource::get_termfreq_max () const
> +{
> +    return db_.get_value_freq (slot_);
> +}
> +
> +Xapian::docid
> +RegexpPostingSource::get_docid () const
> +{
> +    return it_.get_docid ();
> +}
> +
> +bool
> +RegexpPostingSource::at_end () const
> +{
> +    return it_ == end_;
> +}
> +
> +void
> +RegexpPostingSource::next (unused (double min_wt))
> +{
> +    if (started_ && ! at_end ())
> +	++it_;
> +    started_ = true;
> +
> +    for (; ! at_end (); ++it_) {
> +	std::string value = *it_;
> +	if (regexec (&regexp_, value.c_str (), 0, NULL, 0) == 0)
> +	    break;
> +    }
> +}
> +
> +static inline Xapian::valueno _find_slot (std::string prefix)
> +{
> +    if (prefix == "from")
> +	return NOTMUCH_VALUE_FROM;
> +    else if (prefix == "subject")
> +	return NOTMUCH_VALUE_SUBJECT;
> +    else
> +	throw Xapian::QueryParserError ("unsupported regexp field '" + prefix + "'");
> +}
> +
> +RegexpFieldProcessor::RegexpFieldProcessor (std::string prefix, Xapian::QueryParser &parser_, notmuch_database_t *notmuch_)
> +	: slot(_find_slot (prefix)), term_prefix(_find_prefix (prefix.c_str ())), parser(parser_), notmuch(notmuch_)
> +{
> +};
> +
> +Xapian::Query
> +RegexpFieldProcessor::operator() (const std::string & str)
> +{
> +    if (str.at (0) == '/' && str.at (str.size () - 1)){
> +	RegexpPostingSource *postings = new RegexpPostingSource (slot, str.substr(1,str.size () - 2));
> +	return Xapian::Query (postings->release ());
> +    } else {
> +	/* TODO replace this with a nicer API level triggering of
> +	 * phrase parsing, when possible */
> +	std::string quoted='"' + str + '"';
> +	return parser.parse_query (quoted, NOTMUCH_QUERY_PARSER_FLAGS, term_prefix);
> +    }
> +}
> +#endif
> diff --git a/lib/regexp-fields.h b/lib/regexp-fields.h
> new file mode 100644
> index 00000000..bac11999
> --- /dev/null
> +++ b/lib/regexp-fields.h
> @@ -0,0 +1,77 @@
> +/* regex-fields.h - xapian glue for semi-bruteforce regexp search
> + *
> + * This file is part of notmuch.
> + *
> + * Copyright © 2015 Austin Clements
> + * Copyright © 2016 David Bremner
> + *
> + * This program is free software: you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation, either version 3 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see https://www.gnu.org/licenses/ .
> + *
> + * Author: Austin Clements <aclements at csail.mit.edu>
> + *                David Bremner <david at tethera.net>
> + */
> +
> +#ifndef NOTMUCH_REGEXP_FIELDS_H
> +#define NOTMUCH_REGEXP_FIELDS_H
> +#if HAVE_XAPIAN_FIELD_PROCESSOR
> +#include <sys/types.h>
> +#include <regex.h>
> +#include "database-private.h"
> +#include "notmuch-private.h"
> +
> +/* A posting source that returns documents where a value matches a
> + * regexp.
> + */
> +class RegexpPostingSource : public Xapian::PostingSource
> +{
> + protected:
> +    const Xapian::valueno slot_;
> +    regex_t regexp_;
> +    Xapian::Database db_;
> +    bool started_;
> +    Xapian::ValueIterator it_, end_;
> +
> +/* No copying */
> +    RegexpPostingSource (const RegexpPostingSource &);
> +    RegexpPostingSource &operator= (const RegexpPostingSource &);
> +
> + public:
> +    RegexpPostingSource (Xapian::valueno slot, const std::string &regexp);
> +    ~RegexpPostingSource ();
> +    void init (const Xapian::Database &db);
> +    Xapian::doccount get_termfreq_min () const;
> +    Xapian::doccount get_termfreq_est () const;
> +    Xapian::doccount get_termfreq_max () const;
> +    Xapian::docid get_docid () const;
> +    bool at_end () const;
> +    void next (unused (double min_wt));
> +};
> +
> +
> +class RegexpFieldProcessor : public Xapian::FieldProcessor {
> + protected:
> +    Xapian::valueno slot;
> +    std::string term_prefix;
> +    Xapian::QueryParser &parser;
> +    notmuch_database_t *notmuch;
> +
> + public:
> +    RegexpFieldProcessor (std::string prefix, Xapian::QueryParser &parser_, notmuch_database_t *notmuch_);
> +
> +    ~RegexpFieldProcessor () { };
> +
> +    Xapian::Query operator()(const std::string & str);
> +};
> +#endif
> +#endif /* NOTMUCH_REGEXP_FIELDS_H */
> diff --git a/test/T630-regexp-query.sh b/test/T630-regexp-query.sh
> new file mode 100755
> index 00000000..722af715
> --- /dev/null
> +++ b/test/T630-regexp-query.sh
> @@ -0,0 +1,82 @@
> +#!/usr/bin/env bash
> +test_description='regular expression searches'
> +. ./test-lib.sh || exit 1
> +
> +add_email_corpus
> +
> +
> +if [ $NOTMUCH_HAVE_XAPIAN_FIELD_PROCESSOR -eq 1 ]; then
> +
> +    notmuch search --output=messages from:cworth > cworth.msg-ids
> +
> +    test_begin_subtest "regexp from search, case sensitive"
> +    notmuch search --output=messages from:/carl/ > OUTPUT
> +    test_expect_equal_file /dev/null OUTPUT
> +
> +    test_begin_subtest "empty regexp or query"
> +    notmuch search --output=messages from:/carl/ or from:/cworth/ > OUTPUT
> +    test_expect_equal_file cworth.msg-ids OUTPUT
> +
> +    test_begin_subtest "non-empty regexp and query"
> +    notmuch search  from:/cworth at cworth.org/ and subject:patch > OUTPUT
> +    cat <<EOF > EXPECTED
> +thread:0000000000000008   2009-11-18 [1/2] Carl Worth| Alex Botero-Lowry; [notmuch] [PATCH] Error out if no query is supplied to search instead of going into an infinite loop (attachment inbox unread)
> +thread:0000000000000007   2009-11-18 [1/2] Carl Worth| Ingmar Vanhassel; [notmuch] [PATCH] Typsos (inbox unread)
> +thread:0000000000000018   2009-11-18 [1/2] Carl Worth| Jan Janak; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
> +thread:0000000000000017   2009-11-18 [1/2] Carl Worth| Keith Packard; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
> +thread:0000000000000014   2009-11-18 [2/5] Carl Worth| Mikhail Gusarov, Keith Packard; [notmuch] [PATCH 1/2] Close message file after parsing message headers (inbox unread)
> +thread:0000000000000001   2009-11-18 [1/1] Stewart Smith; [notmuch] [PATCH] Fix linking with gcc to use g++ to link in C++ libs. (inbox unread)
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "regexp from search, duplicate term search"
> +    notmuch search --output=messages from:/cworth/ > OUTPUT
> +    test_expect_equal_file cworth.msg-ids OUTPUT
> +
> +    test_begin_subtest "long enough regexp matches only desired senders"
> +    notmuch search --output=messages 'from:"/C.* Wo/"' > OUTPUT
> +    test_expect_equal_file cworth.msg-ids OUTPUT
> +
> +    test_begin_subtest "shorter regexp matches one more sender"
> +    notmuch search --output=messages 'from:"/C.* W/"' > OUTPUT
> +    (echo id:1258544095-16616-1-git-send-email-chris at chris-wilson.co.uk ; cat cworth.msg-ids) > EXPECTED

The above doesn't need to be executed in subshell: 

  { echo id:1258544095-16616-1-git-send-email-chris at chris-wilson.co.uk; cat cworth.msg-ids; } > EXPECTED

does it in the same shell


> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "regexp subject search, non-ASCII"
> +    notmuch search --output=messages subject:/accentué/ > OUTPUT
> +    echo id:877h1wv7mg.fsf at inf-8657.int-evry.fr > EXPECTED
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "regexp subject search, punctuation"
> +    notmuch search   subject:/\'X\'/ > OUTPUT
> +    cat <<EOF > EXPECTED
> +thread:0000000000000017   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "regexp subject search, no punctuation"
> +    notmuch search  subject:/X/ > OUTPUT
> +    cat <<EOF > EXPECTED
> +thread:0000000000000017   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
> +thread:000000000000000f   2009-11-18 [4/4] Jjgod Jiang, Alexander Botero-Lowry; [notmuch] Mac OS X/Darwin compatibility issues (inbox unread)
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "combine regexp from and subject"
> +    notmuch search  subject:/-C/ and from:/.an.k/ > OUTPUT
> +    cat <<EOF > EXPECTED
> +thread:0000000000000018   2009-11-17 [1/2] Jan Janak| Carl Worth; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "regexp error reporting"
> +    notmuch search 'from:/unbalanced[/' 1>OUTPUT 2>&1
> +    cat <<EOF > EXPECTED
> +notmuch search: A Xapian exception occurred
> +A Xapian exception occurred performing query: Invalid regular expression
> +Query string was: from:/unbalanced[/
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +fi
> +
> +test_done
> -- 
> 2.11.0


More information about the notmuch mailing list