[patch v5 4/6] lib: regexp matching in 'subject' and 'from'

Jani Nikula jani at nikula.org
Sun Feb 26 11:46:40 PST 2017


On Thu, 16 Feb 2017, David Bremner <david at tethera.net> wrote:
> the idea is that you can run
>
> % notmuch search subject:/<your-favourite-regexp>/
> % notmuch search from:/<your-favourite-regexp>/
>
> or
>
> % notmuch search subject:"your usual phrase search"
> % notmuch search from:"usual phrase search"
>
> This feature is only available with recent Xapian, specifically
> support for field processors is needed.
>
> It should work with bindings, since it extends the query parser.
>
> This is easy to extend for other value slots, but currently the only
> value slots are date, message_id, from, subject, and last_mod. Date is
> already searchable;  message_id is left for a followup commit.
>
> This was originally written by Austin Clements, and ported to Xapian
> field processors (from Austin's custom query parser) by yours truly.
> ---
>  doc/man7/notmuch-search-terms.rst |  25 ++++++-
>  lib/Makefile.local                |   1 +
>  lib/database.cc                   |  11 +--
>  lib/regexp-fields.cc              | 144 ++++++++++++++++++++++++++++++++++++++
>  lib/regexp-fields.h               |  77 ++++++++++++++++++++
>  test/T630-regexp-query.sh         |  81 +++++++++++++++++++++
>  6 files changed, 332 insertions(+), 7 deletions(-)
>  create mode 100644 lib/regexp-fields.cc
>  create mode 100644 lib/regexp-fields.h
>  create mode 100755 test/T630-regexp-query.sh
>
> diff --git a/doc/man7/notmuch-search-terms.rst b/doc/man7/notmuch-search-terms.rst
> index de93d733..47cab48d 100644
> --- a/doc/man7/notmuch-search-terms.rst
> +++ b/doc/man7/notmuch-search-terms.rst
> @@ -34,10 +34,14 @@ indicate user-supplied values):
>  
>  -  from:<name-or-address>
>  
> +-  from:/<regex>/
> +
>  -  to:<name-or-address>
>  
>  -  subject:<word-or-quoted-phrase>
>  
> +-  subject:/<regex>/
> +
>  -  attachment:<word>
>  
>  -  mimetype:<word>
> @@ -71,6 +75,15 @@ subject of an email. Searching for a phrase in the subject is supported
>  by including quotation marks around the phrase, immediately following
>  **subject:**.
>  
> +If notmuch is built with **Xapian Field Processors** (see below) the
> +**from:** and **subject** prefix can be also used to restrict the
> +results to those whose from/subject value matches a regular expression
> +(see **regex(7)**) delimited with //.
> +
> +::
> +
> +   notmuch search 'from:/bob at .*[.]example[.]com/'
> +
>  The **attachment:** prefix can be used to search for specific filenames
>  (or extensions) of attachments to email messages.
>  
> @@ -220,13 +233,18 @@ Boolean and Probabilistic Prefixes
>  ----------------------------------
>  
>  Xapian (and hence notmuch) prefixes are either **boolean**, supporting
> -exact matches like "tag:inbox"  or **probabilistic**, supporting a more flexible **term** based searching. The prefixes currently supported by notmuch are as follows.
> -
> +exact matches like "tag:inbox" or **probabilistic**, supporting a more
> +flexible **term** based searching. Certain **special** prefixes are
> +processed by notmuch in a way not stricly fitting either of Xapian's
> +built in styles. The prefixes currently supported by notmuch are as
> +follows.
>  
>  Boolean
>     **tag:**, **id:**, **thread:**, **folder:**, **path:**, **property:**
>  Probabilistic
> -   **from:**, **to:**, **subject:**, **attachment:**, **mimetype:**
> +  **to:**, **attachment:**, **mimetype:**
> +Special
> +   **from:**, **query:**, **subject:**
>  
>  Terms and phrases
>  -----------------
> @@ -396,6 +414,7 @@ Currently the following features require field processor support:
>  
>  - non-range date queries, e.g. "date:today"
>  - named queries e.g. "query:my_special_query"
> +- regular expression searches, e.g. "subject:/^\\[SPAM\\]/"
>  
>  SEE ALSO
>  ========
> diff --git a/lib/Makefile.local b/lib/Makefile.local
> index b77e5780..cd92fc79 100644
> --- a/lib/Makefile.local
> +++ b/lib/Makefile.local
> @@ -52,6 +52,7 @@ libnotmuch_cxx_srcs =		\
>  	$(dir)/query.cc		\
>  	$(dir)/query-fp.cc      \
>  	$(dir)/config.cc	\
> +	$(dir)/regexp-fields.cc	\
>  	$(dir)/thread.cc
>  
>  libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o)
> diff --git a/lib/database.cc b/lib/database.cc
> index 450ee295..ee971f32 100644
> --- a/lib/database.cc
> +++ b/lib/database.cc
> @@ -21,6 +21,7 @@
>  #include "database-private.h"
>  #include "parse-time-vrp.h"
>  #include "query-fp.h"
> +#include "regexp-fields.h"
>  #include "string-util.h"
>  
>  #include <iostream>
> @@ -277,7 +278,8 @@ prefix_t prefix_table[] = {
>  						NOTMUCH_FIELD_PROCESSOR },
>  #endif
>      { "from",			"XFROM",	NOTMUCH_FIELD_EXTERNAL |
> -						NOTMUCH_FIELD_PROBABILISTIC },
> +						NOTMUCH_FIELD_PROBABILISTIC |
> +						NOTMUCH_FIELD_PROCESSOR },
>      { "to",			"XTO",		NOTMUCH_FIELD_EXTERNAL |
>  						NOTMUCH_FIELD_PROBABILISTIC },
>      { "attachment",		"XATTACHMENT",	NOTMUCH_FIELD_EXTERNAL |
> @@ -285,7 +287,8 @@ prefix_t prefix_table[] = {
>      { "mimetype",		"XMIMETYPE",	NOTMUCH_FIELD_EXTERNAL |
>  						NOTMUCH_FIELD_PROBABILISTIC },
>      { "subject",		"XSUBJECT",	NOTMUCH_FIELD_EXTERNAL |
> -						NOTMUCH_FIELD_PROBABILISTIC },
> +						NOTMUCH_FIELD_PROBABILISTIC |
> +						NOTMUCH_FIELD_PROCESSOR},
>  };
>  
>  #if HAVE_XAPIAN_FIELD_PROCESSOR
> @@ -295,8 +298,8 @@ _make_field_processor (const char *name, notmuch_database_t *notmuch) {
>  	return (new DateFieldProcessor())->release ();
>      else if (STRNCMP_LITERAL(name, "query") == 0)
>  	return (new QueryFieldProcessor (*notmuch->query_parser, notmuch))->release ();
> -
> -    INTERNAL_ERROR ("no field processor for prefix '%s'\n", name);
> +    else
> +	return (new RegexpFieldProcessor (name, *notmuch->query_parser, notmuch))->release ();
>  }
>  #else
>  #define _make_field_processor(name, db) NULL
> diff --git a/lib/regexp-fields.cc b/lib/regexp-fields.cc
> new file mode 100644
> index 00000000..b2b39504
> --- /dev/null
> +++ b/lib/regexp-fields.cc
> @@ -0,0 +1,144 @@
> +/* regexp-fields.cc - field processor glue for regex supporting fields
> + *
> + * This file is part of notmuch.
> + *
> + * Copyright © 2015 Austin Clements
> + * Copyright © 2016 David Bremner
> + *
> + * This program is free software: you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation, either version 3 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see https://www.gnu.org/licenses/ .
> + *
> + * Author: Austin Clements <aclements at csail.mit.edu>
> + *                David Bremner <david at tethera.net>
> + */
> +
> +#include "regexp-fields.h"
> +#include "notmuch-private.h"
> +#include "database-private.h"
> +
> +#if HAVE_XAPIAN_FIELD_PROCESSOR
> +static void
> +compile_regex (regex_t &regexp, const char *str)
> +{
> +    int err = regcomp (&regexp, str, REG_EXTENDED | REG_NOSUB);
> +
> +    if (err != 0) {
> +	size_t len = regerror (err, &regexp, NULL, 0);
> +	char *buffer = new char[len];
> +	std::string msg;
> +	(void) regerror (err, &regexp, buffer, len);
> +	msg.assign (buffer, len);
> +	delete buffer;
> +
> +	throw Xapian::QueryParserError (msg);
> +    }
> +}
> +
> +RegexpPostingSource::RegexpPostingSource (Xapian::valueno slot, const std::string &regexp)
> +    : slot_ (slot)
> +{
> +    compile_regex (regexp_, regexp.c_str ());
> +}
> +
> +RegexpPostingSource::~RegexpPostingSource ()
> +{
> +    regfree (&regexp_);
> +}
> +
> +void
> +RegexpPostingSource::init (const Xapian::Database &db)
> +{
> +    db_ = db;
> +    it_ = db_.valuestream_begin (slot_);
> +    end_ = db.valuestream_end (slot_);
> +    started_ = false;
> +}
> +
> +Xapian::doccount
> +RegexpPostingSource::get_termfreq_min () const
> +{
> +    return 0;
> +}
> +
> +Xapian::doccount
> +RegexpPostingSource::get_termfreq_est () const
> +{
> +    return get_termfreq_max () / 2;
> +}
> +
> +Xapian::doccount
> +RegexpPostingSource::get_termfreq_max () const
> +{
> +    return db_.get_value_freq (slot_);
> +}
> +
> +Xapian::docid
> +RegexpPostingSource::get_docid () const
> +{
> +    return it_.get_docid ();
> +}
> +
> +bool
> +RegexpPostingSource::at_end () const
> +{
> +    return it_ == end_;
> +}
> +
> +void
> +RegexpPostingSource::next (unused (double min_wt))
> +{
> +    if (started_ && ! at_end ())
> +	++it_;
> +    started_ = true;
> +
> +    for (; ! at_end (); ++it_) {
> +	std::string value = *it_;
> +	if (regexec (&regexp_, value.c_str (), 0, NULL, 0) == 0)
> +	    break;
> +    }
> +}
> +
> +static inline Xapian::valueno _find_slot (std::string prefix)
> +{
> +    if (prefix == "from")
> +	return NOTMUCH_VALUE_FROM;
> +    else if (prefix == "subject")
> +	return NOTMUCH_VALUE_SUBJECT;
> +    else
> +	throw Xapian::QueryParserError ("unsupported regexp field '" + prefix + "'");
> +}
> +
> +RegexpFieldProcessor::RegexpFieldProcessor (std::string prefix, Xapian::QueryParser &parser_, notmuch_database_t *notmuch_)
> +	: slot (_find_slot (prefix)), term_prefix (_find_prefix (prefix.c_str ())),
> +	  parser (parser_), notmuch (notmuch_)
> +{
> +};
> +
> +Xapian::Query
> +RegexpFieldProcessor::operator() (const std::string & str)
> +{
> +    if (str.at (0) == '/') {
> +	if (str.at (str.size () - 1) == '/'){
> +	    RegexpPostingSource *postings = new RegexpPostingSource (slot, str.substr(1,str.size () - 2));
> +	    return Xapian::Query (postings->release ());
> +	} else {
> +	    throw Xapian::QueryParserError ("unmatch regex delimiter in '" + str + "'");
> +	}
> +    } else {
> +	/* TODO replace this with a nicer API level triggering of
> +	 * phrase parsing, when possible */
> +	std::string quoted='"' + str + '"';
> +	return parser.parse_query (quoted, NOTMUCH_QUERY_PARSER_FLAGS, term_prefix);
> +    }
> +}
> +#endif
> diff --git a/lib/regexp-fields.h b/lib/regexp-fields.h
> new file mode 100644
> index 00000000..bac11999
> --- /dev/null
> +++ b/lib/regexp-fields.h
> @@ -0,0 +1,77 @@
> +/* regex-fields.h - xapian glue for semi-bruteforce regexp search
> + *
> + * This file is part of notmuch.
> + *
> + * Copyright © 2015 Austin Clements
> + * Copyright © 2016 David Bremner
> + *
> + * This program is free software: you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation, either version 3 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see https://www.gnu.org/licenses/ .
> + *
> + * Author: Austin Clements <aclements at csail.mit.edu>
> + *                David Bremner <david at tethera.net>
> + */
> +
> +#ifndef NOTMUCH_REGEXP_FIELDS_H
> +#define NOTMUCH_REGEXP_FIELDS_H
> +#if HAVE_XAPIAN_FIELD_PROCESSOR
> +#include <sys/types.h>
> +#include <regex.h>
> +#include "database-private.h"
> +#include "notmuch-private.h"
> +
> +/* A posting source that returns documents where a value matches a
> + * regexp.
> + */
> +class RegexpPostingSource : public Xapian::PostingSource
> +{
> + protected:
> +    const Xapian::valueno slot_;
> +    regex_t regexp_;
> +    Xapian::Database db_;
> +    bool started_;
> +    Xapian::ValueIterator it_, end_;
> +
> +/* No copying */
> +    RegexpPostingSource (const RegexpPostingSource &);
> +    RegexpPostingSource &operator= (const RegexpPostingSource &);
> +
> + public:
> +    RegexpPostingSource (Xapian::valueno slot, const std::string &regexp);
> +    ~RegexpPostingSource ();
> +    void init (const Xapian::Database &db);
> +    Xapian::doccount get_termfreq_min () const;
> +    Xapian::doccount get_termfreq_est () const;
> +    Xapian::doccount get_termfreq_max () const;
> +    Xapian::docid get_docid () const;
> +    bool at_end () const;
> +    void next (unused (double min_wt));
> +};
> +
> +
> +class RegexpFieldProcessor : public Xapian::FieldProcessor {
> + protected:
> +    Xapian::valueno slot;
> +    std::string term_prefix;
> +    Xapian::QueryParser &parser;
> +    notmuch_database_t *notmuch;
> +
> + public:
> +    RegexpFieldProcessor (std::string prefix, Xapian::QueryParser &parser_, notmuch_database_t *notmuch_);
> +
> +    ~RegexpFieldProcessor () { };
> +
> +    Xapian::Query operator()(const std::string & str);
> +};
> +#endif
> +#endif /* NOTMUCH_REGEXP_FIELDS_H */
> diff --git a/test/T630-regexp-query.sh b/test/T630-regexp-query.sh
> new file mode 100755
> index 00000000..96bd8746
> --- /dev/null
> +++ b/test/T630-regexp-query.sh
> @@ -0,0 +1,81 @@
> +#!/usr/bin/env bash
> +test_description='regular expression searches'
> +. ./test-lib.sh || exit 1
> +
> +add_email_corpus
> +
> +
> +if [ $NOTMUCH_HAVE_XAPIAN_FIELD_PROCESSOR -eq 1 ]; then

Nitpick, I guess you could check for -eq 0 and do test_done in the if
block? See T020-compact.sh. Not having the indent below would be nice.

Otherwise, LGTM, with the caveat that I didn't really study how all the
Xapian field processor stuff is supposed to work...

BR,
Jani.



> +
> +    notmuch search --output=messages from:cworth > cworth.msg-ids
> +
> +    test_begin_subtest "regexp from search, case sensitive"
> +    notmuch search --output=messages from:/carl/ > OUTPUT
> +    test_expect_equal_file /dev/null OUTPUT
> +
> +    test_begin_subtest "empty regexp or query"
> +    notmuch search --output=messages from:/carl/ or from:/cworth/ > OUTPUT
> +    test_expect_equal_file cworth.msg-ids OUTPUT
> +
> +    test_begin_subtest "non-empty regexp and query"
> +    notmuch search  from:/cworth at cworth.org/ and subject:patch | notmuch_search_sanitize > OUTPUT
> +    cat <<EOF > EXPECTED
> +thread:XXX   2009-11-18 [1/2] Carl Worth| Alex Botero-Lowry; [notmuch] [PATCH] Error out if no query is supplied to search instead of going into an infinite loop (attachment inbox unread)
> +thread:XXX   2009-11-18 [1/2] Carl Worth| Ingmar Vanhassel; [notmuch] [PATCH] Typsos (inbox unread)
> +thread:XXX   2009-11-18 [1/2] Carl Worth| Jan Janak; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
> +thread:XXX   2009-11-18 [1/2] Carl Worth| Keith Packard; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
> +thread:XXX   2009-11-18 [2/5] Carl Worth| Mikhail Gusarov, Keith Packard; [notmuch] [PATCH 1/2] Close message file after parsing message headers (inbox unread)
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "regexp from search, duplicate term search"
> +    notmuch search --output=messages from:/cworth/ > OUTPUT
> +    test_expect_equal_file cworth.msg-ids OUTPUT
> +
> +    test_begin_subtest "long enough regexp matches only desired senders"
> +    notmuch search --output=messages 'from:"/C.* Wo/"' > OUTPUT
> +    test_expect_equal_file cworth.msg-ids OUTPUT
> +
> +    test_begin_subtest "shorter regexp matches one more sender"
> +    notmuch search --output=messages 'from:"/C.* W/"' > OUTPUT
> +    { echo id:1258544095-16616-1-git-send-email-chris at chris-wilson.co.uk; cat cworth.msg-ids; } > EXPECTED
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "regexp subject search, non-ASCII"
> +    notmuch search --output=messages subject:/accentué/ > OUTPUT
> +    echo id:877h1wv7mg.fsf at inf-8657.int-evry.fr > EXPECTED
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "regexp subject search, punctuation"
> +    notmuch search subject:/\'X\'/ | notmuch_search_sanitize > OUTPUT
> +    cat <<EOF > EXPECTED
> +thread:XXX   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "regexp subject search, no punctuation"
> +    notmuch search  subject:/X/ | notmuch_search_sanitize > OUTPUT
> +    cat <<EOF > EXPECTED
> +thread:XXX   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
> +thread:XXX   2009-11-18 [4/4] Jjgod Jiang, Alexander Botero-Lowry; [notmuch] Mac OS X/Darwin compatibility issues (inbox unread)
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "combine regexp from and subject"
> +    notmuch search  subject:/-C/ and from:/.an.k/ | notmuch_search_sanitize > OUTPUT
> +    cat <<EOF > EXPECTED
> +thread:XXX   2009-11-17 [1/2] Jan Janak| Carl Worth; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "regexp error reporting"
> +    notmuch search 'from:/unbalanced[/' 1>OUTPUT 2>&1
> +    cat <<EOF > EXPECTED
> +notmuch search: A Xapian exception occurred
> +A Xapian exception occurred performing query: Invalid regular expression
> +Query string was: from:/unbalanced[/
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +fi
> +
> +test_done
> -- 
> 2.11.0
>
> _______________________________________________
> notmuch mailing list
> notmuch at notmuchmail.org
> https://notmuchmail.org/mailman/listinfo/notmuch


More information about the notmuch mailing list