[Patch v2] lib: regexp matching in 'subject' and 'from'
Jani Nikula
jani at nikula.org
Wed Jan 18 12:05:18 PST 2017
On Mon, 14 Nov 2016, David Bremner <david at tethera.net> wrote:
> the idea is that you can run
>
> % notmuch search re:subject:<your-favourite-regexp>
> % notmuch search re:from:<your-favourite-regexp>'
>
> or
>
> % notmuch search subject:"your usual phrase search"
> % notmuch search from:"usual phrase search"
>
> This should also work with bindings, since it extends the query parser.
>
> This is trivial to extend for other value slots, but currently the only
> value slots are date, message_id, from, subject, and last_mod. Date is
> already searchable, and message_id is not obviously useful to regex
> match.
>
> This was originally written by Austin Clements, and ported to Xapian
> field processors (from Austin's custom query parser) by yours truly.
I can't say I would have done a detailed review of all the Xapian bits
and pieces here, but I didn't spot anything obviously wrong either.
I suppose I'd prefer the documentation to be more explicit about
"re:subject:" and "re:from:" instead of having the generic "re:<field>:"
that I think is bound to confuse people.
The _ suffixes instead of prefixes in variables seemed a bit odd, but no
strong opinions on it.
I played around with this a bit, and it seemed to work. Unsurprisingly,
getting the quoting right was the hardest part. Even though I know how
the stuff works under the hood, it took me a while to realize that you
have to use 're:"subject:<regex with spaces>"' to make it work. (I kept
trying 're:subject:"<regex with spaces>"'.) I don't know if there's
anything we could really do about this.
BR,
Jani.
> ---
>
> rebase of id:1467034387-16885-1-git-send-email-david at tethera.net against master
>
> doc/man7/notmuch-search-terms.rst | 17 +++++-
> lib/Makefile.local | 1 +
> lib/database-private.h | 1 +
> lib/database.cc | 5 ++
> lib/regexp-fields.cc | 125 ++++++++++++++++++++++++++++++++++++++
> lib/regexp-fields.h | 77 +++++++++++++++++++++++
> test/T630-regexp-query.sh | 91 +++++++++++++++++++++++++++
> 7 files changed, 316 insertions(+), 1 deletion(-)
> create mode 100644 lib/regexp-fields.cc
> create mode 100644 lib/regexp-fields.h
> create mode 100755 test/T630-regexp-query.sh
>
> diff --git a/doc/man7/notmuch-search-terms.rst b/doc/man7/notmuch-search-terms.rst
> index de93d73..4c7afc2 100644
> --- a/doc/man7/notmuch-search-terms.rst
> +++ b/doc/man7/notmuch-search-terms.rst
> @@ -60,6 +60,8 @@ indicate user-supplied values):
>
> - property:<key>=<value>
>
> +- re:{subject,from}:<regex>
> +
> The **from:** prefix is used to match the name or address of the sender
> of an email message.
>
> @@ -146,6 +148,12 @@ The **property:** prefix searches for messages with a particular
> (and extensions) to add metadata to messages. A given key can be
> present on a given message with several different values.
>
> +The **re:<field>:** prefix can be used to restrict the results to
> +those whose <field> matches the given regular expression (see
> +**regex(7)**). Regular expression searches are only available if
> +notmuch is built with **Xapian Field Processors** (see below), and
> +currently only for the Subject and From fields.
> +
> Operators
> ---------
>
> @@ -220,13 +228,19 @@ Boolean and Probabilistic Prefixes
> ----------------------------------
>
> Xapian (and hence notmuch) prefixes are either **boolean**, supporting
> -exact matches like "tag:inbox" or **probabilistic**, supporting a more flexible **term** based searching. The prefixes currently supported by notmuch are as follows.
> +exact matches like "tag:inbox" or **probabilistic**, supporting a more
> +flexible **term** based searching. Certain **special** prefixes are
> +processed by notmuch in a way not stricly fitting either of Xapian's
> +built in styles. The prefixes currently supported by notmuch are as
> +follows.
>
>
> Boolean
> **tag:**, **id:**, **thread:**, **folder:**, **path:**, **property:**
> Probabilistic
> **from:**, **to:**, **subject:**, **attachment:**, **mimetype:**
> +Special
> + **query:**, **re:<field>**
>
> Terms and phrases
> -----------------
> @@ -396,6 +410,7 @@ Currently the following features require field processor support:
>
> - non-range date queries, e.g. "date:today"
> - named queries e.g. "query:my_special_query"
> +- regular expression searches, e.g. "re:subject:^\\[SPAM\\]"
>
> SEE ALSO
> ========
> diff --git a/lib/Makefile.local b/lib/Makefile.local
> index 3d1030a..ccd32ab 100644
> --- a/lib/Makefile.local
> +++ b/lib/Makefile.local
> @@ -53,6 +53,7 @@ libnotmuch_cxx_srcs = \
> $(dir)/query.cc \
> $(dir)/query-fp.cc \
> $(dir)/config.cc \
> + $(dir)/regexp-fields.cc \
> $(dir)/thread.cc
>
> libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o)
> diff --git a/lib/database-private.h b/lib/database-private.h
> index ca71a92..900a989 100644
> --- a/lib/database-private.h
> +++ b/lib/database-private.h
> @@ -186,6 +186,7 @@ struct _notmuch_database {
> #if HAVE_XAPIAN_FIELD_PROCESSOR
> Xapian::FieldProcessor *date_field_processor;
> Xapian::FieldProcessor *query_field_processor;
> + Xapian::FieldProcessor *re_field_processor;
> #endif
> Xapian::ValueRangeProcessor *last_mod_range_processor;
> };
> diff --git a/lib/database.cc b/lib/database.cc
> index 2d19f20..851a62d 100644
> --- a/lib/database.cc
> +++ b/lib/database.cc
> @@ -21,6 +21,7 @@
> #include "database-private.h"
> #include "parse-time-vrp.h"
> #include "query-fp.h"
> +#include "regexp-fields.h"
> #include "string-util.h"
>
> #include <iostream>
> @@ -1042,6 +1043,8 @@ notmuch_database_open_verbose (const char *path,
> notmuch->query_parser->add_boolean_prefix("date", notmuch->date_field_processor);
> notmuch->query_field_processor = new QueryFieldProcessor (*notmuch->query_parser, notmuch);
> notmuch->query_parser->add_boolean_prefix("query", notmuch->query_field_processor);
> + notmuch->re_field_processor = new RegexpFieldProcessor (*notmuch->query_parser, notmuch);
> + notmuch->query_parser->add_boolean_prefix("re", notmuch->re_field_processor);
> #endif
> notmuch->last_mod_range_processor = new Xapian::NumberValueRangeProcessor (NOTMUCH_VALUE_LAST_MOD, "lastmod:");
>
> @@ -1138,6 +1141,8 @@ notmuch_database_close (notmuch_database_t *notmuch)
> notmuch->date_field_processor = NULL;
> delete notmuch->query_field_processor;
> notmuch->query_field_processor = NULL;
> + delete notmuch->re_field_processor;
> + notmuch->re_field_processor = NULL;
> #endif
>
> return status;
> diff --git a/lib/regexp-fields.cc b/lib/regexp-fields.cc
> new file mode 100644
> index 0000000..4d3d972
> --- /dev/null
> +++ b/lib/regexp-fields.cc
> @@ -0,0 +1,125 @@
> +/* regexp-fields.cc - "re:" field processor glue
> + *
> + * This file is part of notmuch.
> + *
> + * Copyright © 2015 Austin Clements
> + * Copyright © 2016 David Bremner
> + *
> + * This program is free software: you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation, either version 3 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see https://www.gnu.org/licenses/ .
> + *
> + * Author: Austin Clements <aclements at csail.mit.edu>
> + * David Bremner <david at tethera.net>
> + */
> +
> +#include "regexp-fields.h"
> +#include "notmuch-private.h"
> +
> +#if HAVE_XAPIAN_FIELD_PROCESSOR
> +RegexpPostingSource::RegexpPostingSource (Xapian::valueno slot, const std::string ®exp)
> + : slot_ (slot)
> +{
> + int err = regcomp (®exp_, regexp.c_str (), REG_EXTENDED | REG_NOSUB);
> +
> + if (err != 0) {
> + size_t len = regerror (err, ®exp_, NULL, 0);
> + char *buffer = new char[len];
> + std::string msg;
> + (void) regerror (err, ®exp_, buffer, len);
> + msg.assign (buffer, len);
> + delete buffer;
> +
> + throw Xapian::QueryParserError (msg);
> + }
> +}
> +
> +RegexpPostingSource::~RegexpPostingSource ()
> +{
> + regfree (®exp_);
> +}
> +
> +void
> +RegexpPostingSource::init (const Xapian::Database &db)
> +{
> + db_ = db;
> + it_ = db_.valuestream_begin (slot_);
> + end_ = db.valuestream_end (slot_);
> + started_ = false;
> +}
> +
> +Xapian::doccount
> +RegexpPostingSource::get_termfreq_min () const
> +{
> + return 0;
> +}
> +
> +Xapian::doccount
> +RegexpPostingSource::get_termfreq_est () const
> +{
> + return get_termfreq_max () / 2;
> +}
> +
> +Xapian::doccount
> +RegexpPostingSource::get_termfreq_max () const
> +{
> + return db_.get_value_freq (slot_);
> +}
> +
> +Xapian::docid
> +RegexpPostingSource::get_docid () const
> +{
> + return it_.get_docid ();
> +}
> +
> +bool
> +RegexpPostingSource::at_end () const
> +{
> + return it_ == end_;
> +}
> +
> +void
> +RegexpPostingSource::next (unused (double min_wt))
> +{
> + if (started_ && ! at_end ())
> + ++it_;
> + started_ = true;
> +
> + for (; ! at_end (); ++it_) {
> + std::string value = *it_;
> + if (regexec (®exp_, value.c_str (), 0, NULL, 0) == 0)
> + break;
> + }
> +}
> +
> +static Xapian::valueno
> +_find_slot (std::string prefix)
> +{
> + if (prefix == "from")
> + return NOTMUCH_VALUE_FROM;
> + else if (prefix == "subject")
> + return NOTMUCH_VALUE_SUBJECT;
> + else
> + throw Xapian::QueryParserError ("unsupported regexp field '" + prefix + "'");
> +}
> +
> +Xapian::Query
> +RegexpFieldProcessor::operator() (const std::string & str)
> +{
> + size_t pos = str.find_first_of (':');
> + std::string prefix = str.substr (0, pos);
> + std::string regexp = str.substr (pos + 1);
> +
> + postings = new RegexpPostingSource (_find_slot (prefix), regexp);
> + return Xapian::Query (postings);
> +}
> +#endif
> diff --git a/lib/regexp-fields.h b/lib/regexp-fields.h
> new file mode 100644
> index 0000000..2c9c2d7
> --- /dev/null
> +++ b/lib/regexp-fields.h
> @@ -0,0 +1,77 @@
> +/* regex-fields.h - xapian glue for semi-bruteforce regexp search
> + *
> + * This file is part of notmuch.
> + *
> + * Copyright © 2015 Austin Clements
> + * Copyright © 2016 David Bremner
> + *
> + * This program is free software: you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation, either version 3 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see https://www.gnu.org/licenses/ .
> + *
> + * Author: Austin Clements <aclements at csail.mit.edu>
> + * David Bremner <david at tethera.net>
> + */
> +
> +#ifndef NOTMUCH_REGEXP_FIELDS_H
> +#define NOTMUCH_REGEXP_FIELDS_H
> +#if HAVE_XAPIAN_FIELD_PROCESSOR
> +#include <sys/types.h>
> +#include <regex.h>
> +#include <xapian.h>
> +#include "notmuch-private.h"
> +
> +/* A posting source that returns documents where a value matches a
> + * regexp.
> + */
> +class RegexpPostingSource : public Xapian::PostingSource
> +{
> + protected:
> + const Xapian::valueno slot_;
> + regex_t regexp_;
> + Xapian::Database db_;
> + bool started_;
> + Xapian::ValueIterator it_, end_;
> +
> +/* No copying */
> + RegexpPostingSource (const RegexpPostingSource &);
> + RegexpPostingSource &operator= (const RegexpPostingSource &);
> +
> + public:
> + RegexpPostingSource (Xapian::valueno slot, const std::string ®exp);
> + ~RegexpPostingSource ();
> + void init (const Xapian::Database &db);
> + Xapian::doccount get_termfreq_min () const;
> + Xapian::doccount get_termfreq_est () const;
> + Xapian::doccount get_termfreq_max () const;
> + Xapian::docid get_docid () const;
> + bool at_end () const;
> + void next (unused (double min_wt));
> +};
> +
> +
> +class RegexpFieldProcessor : public Xapian::FieldProcessor {
> + protected:
> + Xapian::QueryParser &parser;
> + notmuch_database_t *notmuch;
> + RegexpPostingSource *postings = NULL;
> +
> + public:
> + RegexpFieldProcessor (Xapian::QueryParser &parser_, notmuch_database_t *notmuch_)
> + : parser(parser_), notmuch(notmuch_) { };
> +
> + ~RegexpFieldProcessor () { delete postings; };
> +
> + Xapian::Query operator()(const std::string & str);
> +};
> +#endif
> +#endif /* NOTMUCH_REGEXP_FIELDS_H */
> diff --git a/test/T630-regexp-query.sh b/test/T630-regexp-query.sh
> new file mode 100755
> index 0000000..3bbe47c
> --- /dev/null
> +++ b/test/T630-regexp-query.sh
> @@ -0,0 +1,91 @@
> +#!/usr/bin/env bash
> +test_description='regular expression searches'
> +. ./test-lib.sh || exit 1
> +
> +add_email_corpus
> +
> +
> +if [ $NOTMUCH_HAVE_XAPIAN_FIELD_PROCESSOR -eq 1 ]; then
> +
> + notmuch search --output=messages from:cworth > cworth.msg-ids
> +
> + test_begin_subtest "regexp from search, case sensitive"
> + notmuch search --output=messages re:from:carl > OUTPUT
> + test_expect_equal_file /dev/null OUTPUT
> +
> + test_begin_subtest "empty regexp or query"
> + notmuch search --output=messages re:from:carl or from:cworth > OUTPUT
> + test_expect_equal_file cworth.msg-ids OUTPUT
> +
> + test_begin_subtest "non-empty regexp and query"
> + notmuch search re:from:cworth and subject:patch > OUTPUT
> + cat <<EOF > EXPECTED
> +thread:0000000000000008 2009-11-18 [1/2] Carl Worth| Alex Botero-Lowry; [notmuch] [PATCH] Error out if no query is supplied to search instead of going into an infinite loop (attachment inbox unread)
> +thread:0000000000000007 2009-11-18 [1/2] Carl Worth| Ingmar Vanhassel; [notmuch] [PATCH] Typsos (inbox unread)
> +thread:0000000000000018 2009-11-18 [1/2] Carl Worth| Jan Janak; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
> +thread:0000000000000017 2009-11-18 [1/2] Carl Worth| Keith Packard; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
> +thread:0000000000000014 2009-11-18 [2/5] Carl Worth| Mikhail Gusarov, Keith Packard; [notmuch] [PATCH 1/2] Close message file after parsing message headers (inbox unread)
> +thread:0000000000000001 2009-11-18 [1/1] Stewart Smith; [notmuch] [PATCH] Fix linking with gcc to use g++ to link in C++ libs. (inbox unread)
> +EOF
> + test_expect_equal_file EXPECTED OUTPUT
> +
> + test_begin_subtest "regexp from search, duplicate term search"
> + notmuch search --output=messages re:from:cworth > OUTPUT
> + test_expect_equal_file cworth.msg-ids OUTPUT
> +
> + test_begin_subtest "long enough regexp matches only desired senders"
> + notmuch search --output=messages 're:"from:C.* Wo"' > OUTPUT
> + test_expect_equal_file cworth.msg-ids OUTPUT
> +
> + test_begin_subtest "shorter regexp matches one more sender"
> + notmuch search --output=messages 're:"from:C.* W"' > OUTPUT
> + (echo id:1258544095-16616-1-git-send-email-chris at chris-wilson.co.uk ; cat cworth.msg-ids) > EXPECTED
> + test_expect_equal_file EXPECTED OUTPUT
> +
> + test_begin_subtest "regexp subject search, non-ASCII"
> + notmuch search --output=messages re:subject:accentué > OUTPUT
> + echo id:877h1wv7mg.fsf at inf-8657.int-evry.fr > EXPECTED
> + test_expect_equal_file EXPECTED OUTPUT
> +
> + test_begin_subtest "regexp subject search, punctuation"
> + notmuch search re:subject:\'X\' > OUTPUT
> + cat <<EOF > EXPECTED
> +thread:0000000000000017 2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
> +EOF
> + test_expect_equal_file EXPECTED OUTPUT
> +
> + test_begin_subtest "regexp subject search, no punctuation"
> + notmuch search re:subject:X > OUTPUT
> + cat <<EOF > EXPECTED
> +thread:0000000000000017 2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
> +thread:000000000000000f 2009-11-18 [4/4] Jjgod Jiang, Alexander Botero-Lowry; [notmuch] Mac OS X/Darwin compatibility issues (inbox unread)
> +EOF
> + test_expect_equal_file EXPECTED OUTPUT
> +
> + test_begin_subtest "combine regexp from and subject"
> + notmuch search re:subject:-C and re:from:.an.k > OUTPUT
> + cat <<EOF > EXPECTED
> +thread:0000000000000018 2009-11-17 [1/2] Jan Janak| Carl Worth; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
> +EOF
> + test_expect_equal_file EXPECTED OUTPUT
> +
> + test_begin_subtest "bad subprefix"
> + notmuch search 're:unsupported:.*' 1>OUTPUT 2>&1
> + cat <<EOF > EXPECTED
> +notmuch search: A Xapian exception occurred
> +A Xapian exception occurred performing query: unsupported regexp field 'unsupported'
> +Query string was: re:unsupported:.*
> +EOF
> + test_expect_equal_file EXPECTED OUTPUT
> +
> + test_begin_subtest "regexp error reporting"
> + notmuch search 're:from:unbalanced[' 1>OUTPUT 2>&1
> + cat <<EOF > EXPECTED
> +notmuch search: A Xapian exception occurred
> +A Xapian exception occurred performing query: Invalid regular expression
> +Query string was: re:from:unbalanced[
> +EOF
> + test_expect_equal_file EXPECTED OUTPUT
> +fi
> +
> +test_done
> --
> 2.10.2
>
> _______________________________________________
> notmuch mailing list
> notmuch at notmuchmail.org
> https://notmuchmail.org/mailman/listinfo/notmuch
More information about the notmuch
mailing list