[Patch v2] lib: regexp matching in 'subject' and 'from'

Jani Nikula jani at nikula.org
Wed Jan 18 12:05:18 PST 2017


On Mon, 14 Nov 2016, David Bremner <david at tethera.net> wrote:
> the idea is that you can run
>
> % notmuch search re:subject:<your-favourite-regexp>
> % notmuch search re:from:<your-favourite-regexp>'
>
> or
>
> % notmuch search subject:"your usual phrase search"
> % notmuch search from:"usual phrase search"
>
> This should also work with bindings, since it extends the query parser.
>
> This is trivial to extend for other value slots, but currently the only
> value slots are date, message_id, from, subject, and last_mod. Date is
> already searchable, and message_id is not obviously useful to regex
> match.
>
> This was originally written by Austin Clements, and ported to Xapian
> field processors (from Austin's custom query parser) by yours truly.

I can't say I would have done a detailed review of all the Xapian bits
and pieces here, but I didn't spot anything obviously wrong either.

I suppose I'd prefer the documentation to be more explicit about
"re:subject:" and "re:from:" instead of having the generic "re:<field>:"
that I think is bound to confuse people.

The _ suffixes instead of prefixes in variables seemed a bit odd, but no
strong opinions on it.

I played around with this a bit, and it seemed to work. Unsurprisingly,
getting the quoting right was the hardest part. Even though I know how
the stuff works under the hood, it took me a while to realize that you
have to use 're:"subject:<regex with spaces>"' to make it work. (I kept
trying 're:subject:"<regex with spaces>"'.) I don't know if there's
anything we could really do about this.

BR,
Jani.



> ---
>
> rebase of id:1467034387-16885-1-git-send-email-david at tethera.net against master
>
>  doc/man7/notmuch-search-terms.rst |  17 +++++-
>  lib/Makefile.local                |   1 +
>  lib/database-private.h            |   1 +
>  lib/database.cc                   |   5 ++
>  lib/regexp-fields.cc              | 125 ++++++++++++++++++++++++++++++++++++++
>  lib/regexp-fields.h               |  77 +++++++++++++++++++++++
>  test/T630-regexp-query.sh         |  91 +++++++++++++++++++++++++++
>  7 files changed, 316 insertions(+), 1 deletion(-)
>  create mode 100644 lib/regexp-fields.cc
>  create mode 100644 lib/regexp-fields.h
>  create mode 100755 test/T630-regexp-query.sh
>
> diff --git a/doc/man7/notmuch-search-terms.rst b/doc/man7/notmuch-search-terms.rst
> index de93d73..4c7afc2 100644
> --- a/doc/man7/notmuch-search-terms.rst
> +++ b/doc/man7/notmuch-search-terms.rst
> @@ -60,6 +60,8 @@ indicate user-supplied values):
>  
>  -  property:<key>=<value>
>  
> +- re:{subject,from}:<regex>
> +
>  The **from:** prefix is used to match the name or address of the sender
>  of an email message.
>  
> @@ -146,6 +148,12 @@ The **property:** prefix searches for messages with a particular
>  (and extensions) to add metadata to messages. A given key can be
>  present on a given message with several different values.
>  
> +The **re:<field>:** prefix can be used to restrict the results to
> +those whose <field> matches the given regular expression (see
> +**regex(7)**). Regular expression searches are only available if
> +notmuch is built with **Xapian Field Processors** (see below), and
> +currently only for the Subject and From fields.
> +
>  Operators
>  ---------
>  
> @@ -220,13 +228,19 @@ Boolean and Probabilistic Prefixes
>  ----------------------------------
>  
>  Xapian (and hence notmuch) prefixes are either **boolean**, supporting
> -exact matches like "tag:inbox"  or **probabilistic**, supporting a more flexible **term** based searching. The prefixes currently supported by notmuch are as follows.
> +exact matches like "tag:inbox" or **probabilistic**, supporting a more
> +flexible **term** based searching. Certain **special** prefixes are
> +processed by notmuch in a way not stricly fitting either of Xapian's
> +built in styles. The prefixes currently supported by notmuch are as
> +follows.
>  
>  
>  Boolean
>     **tag:**, **id:**, **thread:**, **folder:**, **path:**, **property:**
>  Probabilistic
>     **from:**, **to:**, **subject:**, **attachment:**, **mimetype:**
> +Special
> +   **query:**, **re:<field>**
>  
>  Terms and phrases
>  -----------------
> @@ -396,6 +410,7 @@ Currently the following features require field processor support:
>  
>  - non-range date queries, e.g. "date:today"
>  - named queries e.g. "query:my_special_query"
> +- regular expression searches, e.g. "re:subject:^\\[SPAM\\]"
>  
>  SEE ALSO
>  ========
> diff --git a/lib/Makefile.local b/lib/Makefile.local
> index 3d1030a..ccd32ab 100644
> --- a/lib/Makefile.local
> +++ b/lib/Makefile.local
> @@ -53,6 +53,7 @@ libnotmuch_cxx_srcs =		\
>  	$(dir)/query.cc		\
>  	$(dir)/query-fp.cc      \
>  	$(dir)/config.cc	\
> +	$(dir)/regexp-fields.cc     \
>  	$(dir)/thread.cc
>  
>  libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o)
> diff --git a/lib/database-private.h b/lib/database-private.h
> index ca71a92..900a989 100644
> --- a/lib/database-private.h
> +++ b/lib/database-private.h
> @@ -186,6 +186,7 @@ struct _notmuch_database {
>  #if HAVE_XAPIAN_FIELD_PROCESSOR
>      Xapian::FieldProcessor *date_field_processor;
>      Xapian::FieldProcessor *query_field_processor;
> +    Xapian::FieldProcessor *re_field_processor;
>  #endif
>      Xapian::ValueRangeProcessor *last_mod_range_processor;
>  };
> diff --git a/lib/database.cc b/lib/database.cc
> index 2d19f20..851a62d 100644
> --- a/lib/database.cc
> +++ b/lib/database.cc
> @@ -21,6 +21,7 @@
>  #include "database-private.h"
>  #include "parse-time-vrp.h"
>  #include "query-fp.h"
> +#include "regexp-fields.h"
>  #include "string-util.h"
>  
>  #include <iostream>
> @@ -1042,6 +1043,8 @@ notmuch_database_open_verbose (const char *path,
>  	notmuch->query_parser->add_boolean_prefix("date", notmuch->date_field_processor);
>  	notmuch->query_field_processor = new QueryFieldProcessor (*notmuch->query_parser, notmuch);
>  	notmuch->query_parser->add_boolean_prefix("query", notmuch->query_field_processor);
> +	notmuch->re_field_processor = new RegexpFieldProcessor (*notmuch->query_parser, notmuch);
> +	notmuch->query_parser->add_boolean_prefix("re", notmuch->re_field_processor);
>  #endif
>  	notmuch->last_mod_range_processor = new Xapian::NumberValueRangeProcessor (NOTMUCH_VALUE_LAST_MOD, "lastmod:");
>  
> @@ -1138,6 +1141,8 @@ notmuch_database_close (notmuch_database_t *notmuch)
>      notmuch->date_field_processor = NULL;
>      delete notmuch->query_field_processor;
>      notmuch->query_field_processor = NULL;
> +    delete notmuch->re_field_processor;
> +    notmuch->re_field_processor = NULL;
>  #endif
>  
>      return status;
> diff --git a/lib/regexp-fields.cc b/lib/regexp-fields.cc
> new file mode 100644
> index 0000000..4d3d972
> --- /dev/null
> +++ b/lib/regexp-fields.cc
> @@ -0,0 +1,125 @@
> +/* regexp-fields.cc - "re:" field processor glue
> + *
> + * This file is part of notmuch.
> + *
> + * Copyright © 2015 Austin Clements
> + * Copyright © 2016 David Bremner
> + *
> + * This program is free software: you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation, either version 3 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see https://www.gnu.org/licenses/ .
> + *
> + * Author: Austin Clements <aclements at csail.mit.edu>
> + *                David Bremner <david at tethera.net>
> + */
> +
> +#include "regexp-fields.h"
> +#include "notmuch-private.h"
> +
> +#if HAVE_XAPIAN_FIELD_PROCESSOR
> +RegexpPostingSource::RegexpPostingSource (Xapian::valueno slot, const std::string &regexp)
> +    : slot_ (slot)
> +{
> +    int err = regcomp (&regexp_, regexp.c_str (), REG_EXTENDED | REG_NOSUB);
> +
> +    if (err != 0) {
> +	size_t len = regerror (err, &regexp_, NULL, 0);
> +	char *buffer = new char[len];
> +	std::string msg;
> +	(void) regerror (err, &regexp_, buffer, len);
> +	msg.assign (buffer, len);
> +	delete buffer;
> +
> +	throw Xapian::QueryParserError (msg);
> +    }
> +}
> +
> +RegexpPostingSource::~RegexpPostingSource ()
> +{
> +    regfree (&regexp_);
> +}
> +
> +void
> +RegexpPostingSource::init (const Xapian::Database &db)
> +{
> +    db_ = db;
> +    it_ = db_.valuestream_begin (slot_);
> +    end_ = db.valuestream_end (slot_);
> +    started_ = false;
> +}
> +
> +Xapian::doccount
> +RegexpPostingSource::get_termfreq_min () const
> +{
> +    return 0;
> +}
> +
> +Xapian::doccount
> +RegexpPostingSource::get_termfreq_est () const
> +{
> +    return get_termfreq_max () / 2;
> +}
> +
> +Xapian::doccount
> +RegexpPostingSource::get_termfreq_max () const
> +{
> +    return db_.get_value_freq (slot_);
> +}
> +
> +Xapian::docid
> +RegexpPostingSource::get_docid () const
> +{
> +    return it_.get_docid ();
> +}
> +
> +bool
> +RegexpPostingSource::at_end () const
> +{
> +    return it_ == end_;
> +}
> +
> +void
> +RegexpPostingSource::next (unused (double min_wt))
> +{
> +    if (started_ && ! at_end ())
> +	++it_;
> +    started_ = true;
> +
> +    for (; ! at_end (); ++it_) {
> +	std::string value = *it_;
> +	if (regexec (&regexp_, value.c_str (), 0, NULL, 0) == 0)
> +	    break;
> +    }
> +}
> +
> +static Xapian::valueno
> +_find_slot (std::string prefix)
> +{
> +    if (prefix == "from")
> +	return NOTMUCH_VALUE_FROM;
> +    else if (prefix == "subject")
> +	return NOTMUCH_VALUE_SUBJECT;
> +    else
> +	throw Xapian::QueryParserError ("unsupported regexp field '" + prefix + "'");
> +}
> +
> +Xapian::Query
> +RegexpFieldProcessor::operator() (const std::string & str)
> +{
> +    size_t pos = str.find_first_of (':');
> +    std::string prefix = str.substr (0, pos);
> +    std::string regexp = str.substr (pos + 1);
> +
> +    postings = new RegexpPostingSource (_find_slot (prefix), regexp);
> +    return Xapian::Query (postings);
> +}
> +#endif
> diff --git a/lib/regexp-fields.h b/lib/regexp-fields.h
> new file mode 100644
> index 0000000..2c9c2d7
> --- /dev/null
> +++ b/lib/regexp-fields.h
> @@ -0,0 +1,77 @@
> +/* regex-fields.h - xapian glue for semi-bruteforce regexp search
> + *
> + * This file is part of notmuch.
> + *
> + * Copyright © 2015 Austin Clements
> + * Copyright © 2016 David Bremner
> + *
> + * This program is free software: you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation, either version 3 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see https://www.gnu.org/licenses/ .
> + *
> + * Author: Austin Clements <aclements at csail.mit.edu>
> + *                David Bremner <david at tethera.net>
> + */
> +
> +#ifndef NOTMUCH_REGEXP_FIELDS_H
> +#define NOTMUCH_REGEXP_FIELDS_H
> +#if HAVE_XAPIAN_FIELD_PROCESSOR
> +#include <sys/types.h>
> +#include <regex.h>
> +#include <xapian.h>
> +#include "notmuch-private.h"
> +
> +/* A posting source that returns documents where a value matches a
> + * regexp.
> + */
> +class RegexpPostingSource : public Xapian::PostingSource
> +{
> + protected:
> +    const Xapian::valueno slot_;
> +    regex_t regexp_;
> +    Xapian::Database db_;
> +    bool started_;
> +    Xapian::ValueIterator it_, end_;
> +
> +/* No copying */
> +    RegexpPostingSource (const RegexpPostingSource &);
> +    RegexpPostingSource &operator= (const RegexpPostingSource &);
> +
> + public:
> +    RegexpPostingSource (Xapian::valueno slot, const std::string &regexp);
> +    ~RegexpPostingSource ();
> +    void init (const Xapian::Database &db);
> +    Xapian::doccount get_termfreq_min () const;
> +    Xapian::doccount get_termfreq_est () const;
> +    Xapian::doccount get_termfreq_max () const;
> +    Xapian::docid get_docid () const;
> +    bool at_end () const;
> +    void next (unused (double min_wt));
> +};
> +
> +
> +class RegexpFieldProcessor : public Xapian::FieldProcessor {
> + protected:
> +    Xapian::QueryParser &parser;
> +    notmuch_database_t *notmuch;
> +    RegexpPostingSource *postings = NULL;
> +
> + public:
> +    RegexpFieldProcessor (Xapian::QueryParser &parser_, notmuch_database_t *notmuch_)
> +	: parser(parser_), notmuch(notmuch_) { };
> +
> +    ~RegexpFieldProcessor () { delete postings; };
> +
> +    Xapian::Query operator()(const std::string & str);
> +};
> +#endif
> +#endif /* NOTMUCH_REGEXP_FIELDS_H */
> diff --git a/test/T630-regexp-query.sh b/test/T630-regexp-query.sh
> new file mode 100755
> index 0000000..3bbe47c
> --- /dev/null
> +++ b/test/T630-regexp-query.sh
> @@ -0,0 +1,91 @@
> +#!/usr/bin/env bash
> +test_description='regular expression searches'
> +. ./test-lib.sh || exit 1
> +
> +add_email_corpus
> +
> +
> +if [ $NOTMUCH_HAVE_XAPIAN_FIELD_PROCESSOR -eq 1 ]; then
> +
> +    notmuch search --output=messages from:cworth > cworth.msg-ids
> +
> +    test_begin_subtest "regexp from search, case sensitive"
> +    notmuch search --output=messages re:from:carl > OUTPUT
> +    test_expect_equal_file /dev/null OUTPUT
> +
> +    test_begin_subtest "empty regexp or query"
> +    notmuch search --output=messages re:from:carl or from:cworth > OUTPUT
> +    test_expect_equal_file cworth.msg-ids OUTPUT
> +
> +    test_begin_subtest "non-empty regexp and query"
> +    notmuch search  re:from:cworth and subject:patch > OUTPUT
> +    cat <<EOF > EXPECTED
> +thread:0000000000000008   2009-11-18 [1/2] Carl Worth| Alex Botero-Lowry; [notmuch] [PATCH] Error out if no query is supplied to search instead of going into an infinite loop (attachment inbox unread)
> +thread:0000000000000007   2009-11-18 [1/2] Carl Worth| Ingmar Vanhassel; [notmuch] [PATCH] Typsos (inbox unread)
> +thread:0000000000000018   2009-11-18 [1/2] Carl Worth| Jan Janak; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
> +thread:0000000000000017   2009-11-18 [1/2] Carl Worth| Keith Packard; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
> +thread:0000000000000014   2009-11-18 [2/5] Carl Worth| Mikhail Gusarov, Keith Packard; [notmuch] [PATCH 1/2] Close message file after parsing message headers (inbox unread)
> +thread:0000000000000001   2009-11-18 [1/1] Stewart Smith; [notmuch] [PATCH] Fix linking with gcc to use g++ to link in C++ libs. (inbox unread)
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "regexp from search, duplicate term search"
> +    notmuch search --output=messages re:from:cworth > OUTPUT
> +    test_expect_equal_file cworth.msg-ids OUTPUT
> +
> +    test_begin_subtest "long enough regexp matches only desired senders"
> +    notmuch search --output=messages 're:"from:C.* Wo"' > OUTPUT
> +    test_expect_equal_file cworth.msg-ids OUTPUT
> +
> +    test_begin_subtest "shorter regexp matches one more sender"
> +    notmuch search --output=messages 're:"from:C.* W"' > OUTPUT
> +    (echo id:1258544095-16616-1-git-send-email-chris at chris-wilson.co.uk ; cat cworth.msg-ids) > EXPECTED
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "regexp subject search, non-ASCII"
> +    notmuch search --output=messages re:subject:accentué > OUTPUT
> +    echo id:877h1wv7mg.fsf at inf-8657.int-evry.fr > EXPECTED
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "regexp subject search, punctuation"
> +    notmuch search   re:subject:\'X\' > OUTPUT
> +    cat <<EOF > EXPECTED
> +thread:0000000000000017   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "regexp subject search, no punctuation"
> +    notmuch search  re:subject:X > OUTPUT
> +    cat <<EOF > EXPECTED
> +thread:0000000000000017   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
> +thread:000000000000000f   2009-11-18 [4/4] Jjgod Jiang, Alexander Botero-Lowry; [notmuch] Mac OS X/Darwin compatibility issues (inbox unread)
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "combine regexp from and subject"
> +    notmuch search  re:subject:-C and re:from:.an.k > OUTPUT
> +    cat <<EOF > EXPECTED
> +thread:0000000000000018   2009-11-17 [1/2] Jan Janak| Carl Worth; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "bad subprefix"
> +    notmuch search 're:unsupported:.*' 1>OUTPUT 2>&1
> +    cat <<EOF > EXPECTED
> +notmuch search: A Xapian exception occurred
> +A Xapian exception occurred performing query: unsupported regexp field 'unsupported'
> +Query string was: re:unsupported:.*
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +
> +    test_begin_subtest "regexp error reporting"
> +    notmuch search 're:from:unbalanced[' 1>OUTPUT 2>&1
> +    cat <<EOF > EXPECTED
> +notmuch search: A Xapian exception occurred
> +A Xapian exception occurred performing query: Invalid regular expression
> +Query string was: re:from:unbalanced[
> +EOF
> +    test_expect_equal_file EXPECTED OUTPUT
> +fi
> +
> +test_done
> -- 
> 2.10.2
>
> _______________________________________________
> notmuch mailing list
> notmuch at notmuchmail.org
> https://notmuchmail.org/mailman/listinfo/notmuch


More information about the notmuch mailing list