[patch v5 4/6] lib: regexp matching in 'subject' and 'from'
Jani Nikula
jani at nikula.org
Sun Feb 26 11:46:40 PST 2017
On Thu, 16 Feb 2017, David Bremner <david at tethera.net> wrote:
> the idea is that you can run
>
> % notmuch search subject:/<your-favourite-regexp>/
> % notmuch search from:/<your-favourite-regexp>/
>
> or
>
> % notmuch search subject:"your usual phrase search"
> % notmuch search from:"usual phrase search"
>
> This feature is only available with recent Xapian, specifically
> support for field processors is needed.
>
> It should work with bindings, since it extends the query parser.
>
> This is easy to extend for other value slots, but currently the only
> value slots are date, message_id, from, subject, and last_mod. Date is
> already searchable; message_id is left for a followup commit.
>
> This was originally written by Austin Clements, and ported to Xapian
> field processors (from Austin's custom query parser) by yours truly.
> ---
> doc/man7/notmuch-search-terms.rst | 25 ++++++-
> lib/Makefile.local | 1 +
> lib/database.cc | 11 +--
> lib/regexp-fields.cc | 144 ++++++++++++++++++++++++++++++++++++++
> lib/regexp-fields.h | 77 ++++++++++++++++++++
> test/T630-regexp-query.sh | 81 +++++++++++++++++++++
> 6 files changed, 332 insertions(+), 7 deletions(-)
> create mode 100644 lib/regexp-fields.cc
> create mode 100644 lib/regexp-fields.h
> create mode 100755 test/T630-regexp-query.sh
>
> diff --git a/doc/man7/notmuch-search-terms.rst b/doc/man7/notmuch-search-terms.rst
> index de93d733..47cab48d 100644
> --- a/doc/man7/notmuch-search-terms.rst
> +++ b/doc/man7/notmuch-search-terms.rst
> @@ -34,10 +34,14 @@ indicate user-supplied values):
>
> - from:<name-or-address>
>
> +- from:/<regex>/
> +
> - to:<name-or-address>
>
> - subject:<word-or-quoted-phrase>
>
> +- subject:/<regex>/
> +
> - attachment:<word>
>
> - mimetype:<word>
> @@ -71,6 +75,15 @@ subject of an email. Searching for a phrase in the subject is supported
> by including quotation marks around the phrase, immediately following
> **subject:**.
>
> +If notmuch is built with **Xapian Field Processors** (see below) the
> +**from:** and **subject** prefix can be also used to restrict the
> +results to those whose from/subject value matches a regular expression
> +(see **regex(7)**) delimited with //.
> +
> +::
> +
> + notmuch search 'from:/bob at .*[.]example[.]com/'
> +
> The **attachment:** prefix can be used to search for specific filenames
> (or extensions) of attachments to email messages.
>
> @@ -220,13 +233,18 @@ Boolean and Probabilistic Prefixes
> ----------------------------------
>
> Xapian (and hence notmuch) prefixes are either **boolean**, supporting
> -exact matches like "tag:inbox" or **probabilistic**, supporting a more flexible **term** based searching. The prefixes currently supported by notmuch are as follows.
> -
> +exact matches like "tag:inbox" or **probabilistic**, supporting a more
> +flexible **term** based searching. Certain **special** prefixes are
> +processed by notmuch in a way not stricly fitting either of Xapian's
> +built in styles. The prefixes currently supported by notmuch are as
> +follows.
>
> Boolean
> **tag:**, **id:**, **thread:**, **folder:**, **path:**, **property:**
> Probabilistic
> - **from:**, **to:**, **subject:**, **attachment:**, **mimetype:**
> + **to:**, **attachment:**, **mimetype:**
> +Special
> + **from:**, **query:**, **subject:**
>
> Terms and phrases
> -----------------
> @@ -396,6 +414,7 @@ Currently the following features require field processor support:
>
> - non-range date queries, e.g. "date:today"
> - named queries e.g. "query:my_special_query"
> +- regular expression searches, e.g. "subject:/^\\[SPAM\\]/"
>
> SEE ALSO
> ========
> diff --git a/lib/Makefile.local b/lib/Makefile.local
> index b77e5780..cd92fc79 100644
> --- a/lib/Makefile.local
> +++ b/lib/Makefile.local
> @@ -52,6 +52,7 @@ libnotmuch_cxx_srcs = \
> $(dir)/query.cc \
> $(dir)/query-fp.cc \
> $(dir)/config.cc \
> + $(dir)/regexp-fields.cc \
> $(dir)/thread.cc
>
> libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o)
> diff --git a/lib/database.cc b/lib/database.cc
> index 450ee295..ee971f32 100644
> --- a/lib/database.cc
> +++ b/lib/database.cc
> @@ -21,6 +21,7 @@
> #include "database-private.h"
> #include "parse-time-vrp.h"
> #include "query-fp.h"
> +#include "regexp-fields.h"
> #include "string-util.h"
>
> #include <iostream>
> @@ -277,7 +278,8 @@ prefix_t prefix_table[] = {
> NOTMUCH_FIELD_PROCESSOR },
> #endif
> { "from", "XFROM", NOTMUCH_FIELD_EXTERNAL |
> - NOTMUCH_FIELD_PROBABILISTIC },
> + NOTMUCH_FIELD_PROBABILISTIC |
> + NOTMUCH_FIELD_PROCESSOR },
> { "to", "XTO", NOTMUCH_FIELD_EXTERNAL |
> NOTMUCH_FIELD_PROBABILISTIC },
> { "attachment", "XATTACHMENT", NOTMUCH_FIELD_EXTERNAL |
> @@ -285,7 +287,8 @@ prefix_t prefix_table[] = {
> { "mimetype", "XMIMETYPE", NOTMUCH_FIELD_EXTERNAL |
> NOTMUCH_FIELD_PROBABILISTIC },
> { "subject", "XSUBJECT", NOTMUCH_FIELD_EXTERNAL |
> - NOTMUCH_FIELD_PROBABILISTIC },
> + NOTMUCH_FIELD_PROBABILISTIC |
> + NOTMUCH_FIELD_PROCESSOR},
> };
>
> #if HAVE_XAPIAN_FIELD_PROCESSOR
> @@ -295,8 +298,8 @@ _make_field_processor (const char *name, notmuch_database_t *notmuch) {
> return (new DateFieldProcessor())->release ();
> else if (STRNCMP_LITERAL(name, "query") == 0)
> return (new QueryFieldProcessor (*notmuch->query_parser, notmuch))->release ();
> -
> - INTERNAL_ERROR ("no field processor for prefix '%s'\n", name);
> + else
> + return (new RegexpFieldProcessor (name, *notmuch->query_parser, notmuch))->release ();
> }
> #else
> #define _make_field_processor(name, db) NULL
> diff --git a/lib/regexp-fields.cc b/lib/regexp-fields.cc
> new file mode 100644
> index 00000000..b2b39504
> --- /dev/null
> +++ b/lib/regexp-fields.cc
> @@ -0,0 +1,144 @@
> +/* regexp-fields.cc - field processor glue for regex supporting fields
> + *
> + * This file is part of notmuch.
> + *
> + * Copyright © 2015 Austin Clements
> + * Copyright © 2016 David Bremner
> + *
> + * This program is free software: you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation, either version 3 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see https://www.gnu.org/licenses/ .
> + *
> + * Author: Austin Clements <aclements at csail.mit.edu>
> + * David Bremner <david at tethera.net>
> + */
> +
> +#include "regexp-fields.h"
> +#include "notmuch-private.h"
> +#include "database-private.h"
> +
> +#if HAVE_XAPIAN_FIELD_PROCESSOR
> +static void
> +compile_regex (regex_t ®exp, const char *str)
> +{
> + int err = regcomp (®exp, str, REG_EXTENDED | REG_NOSUB);
> +
> + if (err != 0) {
> + size_t len = regerror (err, ®exp, NULL, 0);
> + char *buffer = new char[len];
> + std::string msg;
> + (void) regerror (err, ®exp, buffer, len);
> + msg.assign (buffer, len);
> + delete buffer;
> +
> + throw Xapian::QueryParserError (msg);
> + }
> +}
> +
> +RegexpPostingSource::RegexpPostingSource (Xapian::valueno slot, const std::string ®exp)
> + : slot_ (slot)
> +{
> + compile_regex (regexp_, regexp.c_str ());
> +}
> +
> +RegexpPostingSource::~RegexpPostingSource ()
> +{
> + regfree (®exp_);
> +}
> +
> +void
> +RegexpPostingSource::init (const Xapian::Database &db)
> +{
> + db_ = db;
> + it_ = db_.valuestream_begin (slot_);
> + end_ = db.valuestream_end (slot_);
> + started_ = false;
> +}
> +
> +Xapian::doccount
> +RegexpPostingSource::get_termfreq_min () const
> +{
> + return 0;
> +}
> +
> +Xapian::doccount
> +RegexpPostingSource::get_termfreq_est () const
> +{
> + return get_termfreq_max () / 2;
> +}
> +
> +Xapian::doccount
> +RegexpPostingSource::get_termfreq_max () const
> +{
> + return db_.get_value_freq (slot_);
> +}
> +
> +Xapian::docid
> +RegexpPostingSource::get_docid () const
> +{
> + return it_.get_docid ();
> +}
> +
> +bool
> +RegexpPostingSource::at_end () const
> +{
> + return it_ == end_;
> +}
> +
> +void
> +RegexpPostingSource::next (unused (double min_wt))
> +{
> + if (started_ && ! at_end ())
> + ++it_;
> + started_ = true;
> +
> + for (; ! at_end (); ++it_) {
> + std::string value = *it_;
> + if (regexec (®exp_, value.c_str (), 0, NULL, 0) == 0)
> + break;
> + }
> +}
> +
> +static inline Xapian::valueno _find_slot (std::string prefix)
> +{
> + if (prefix == "from")
> + return NOTMUCH_VALUE_FROM;
> + else if (prefix == "subject")
> + return NOTMUCH_VALUE_SUBJECT;
> + else
> + throw Xapian::QueryParserError ("unsupported regexp field '" + prefix + "'");
> +}
> +
> +RegexpFieldProcessor::RegexpFieldProcessor (std::string prefix, Xapian::QueryParser &parser_, notmuch_database_t *notmuch_)
> + : slot (_find_slot (prefix)), term_prefix (_find_prefix (prefix.c_str ())),
> + parser (parser_), notmuch (notmuch_)
> +{
> +};
> +
> +Xapian::Query
> +RegexpFieldProcessor::operator() (const std::string & str)
> +{
> + if (str.at (0) == '/') {
> + if (str.at (str.size () - 1) == '/'){
> + RegexpPostingSource *postings = new RegexpPostingSource (slot, str.substr(1,str.size () - 2));
> + return Xapian::Query (postings->release ());
> + } else {
> + throw Xapian::QueryParserError ("unmatch regex delimiter in '" + str + "'");
> + }
> + } else {
> + /* TODO replace this with a nicer API level triggering of
> + * phrase parsing, when possible */
> + std::string quoted='"' + str + '"';
> + return parser.parse_query (quoted, NOTMUCH_QUERY_PARSER_FLAGS, term_prefix);
> + }
> +}
> +#endif
> diff --git a/lib/regexp-fields.h b/lib/regexp-fields.h
> new file mode 100644
> index 00000000..bac11999
> --- /dev/null
> +++ b/lib/regexp-fields.h
> @@ -0,0 +1,77 @@
> +/* regex-fields.h - xapian glue for semi-bruteforce regexp search
> + *
> + * This file is part of notmuch.
> + *
> + * Copyright © 2015 Austin Clements
> + * Copyright © 2016 David Bremner
> + *
> + * This program is free software: you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation, either version 3 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see https://www.gnu.org/licenses/ .
> + *
> + * Author: Austin Clements <aclements at csail.mit.edu>
> + * David Bremner <david at tethera.net>
> + */
> +
> +#ifndef NOTMUCH_REGEXP_FIELDS_H
> +#define NOTMUCH_REGEXP_FIELDS_H
> +#if HAVE_XAPIAN_FIELD_PROCESSOR
> +#include <sys/types.h>
> +#include <regex.h>
> +#include "database-private.h"
> +#include "notmuch-private.h"
> +
> +/* A posting source that returns documents where a value matches a
> + * regexp.
> + */
> +class RegexpPostingSource : public Xapian::PostingSource
> +{
> + protected:
> + const Xapian::valueno slot_;
> + regex_t regexp_;
> + Xapian::Database db_;
> + bool started_;
> + Xapian::ValueIterator it_, end_;
> +
> +/* No copying */
> + RegexpPostingSource (const RegexpPostingSource &);
> + RegexpPostingSource &operator= (const RegexpPostingSource &);
> +
> + public:
> + RegexpPostingSource (Xapian::valueno slot, const std::string ®exp);
> + ~RegexpPostingSource ();
> + void init (const Xapian::Database &db);
> + Xapian::doccount get_termfreq_min () const;
> + Xapian::doccount get_termfreq_est () const;
> + Xapian::doccount get_termfreq_max () const;
> + Xapian::docid get_docid () const;
> + bool at_end () const;
> + void next (unused (double min_wt));
> +};
> +
> +
> +class RegexpFieldProcessor : public Xapian::FieldProcessor {
> + protected:
> + Xapian::valueno slot;
> + std::string term_prefix;
> + Xapian::QueryParser &parser;
> + notmuch_database_t *notmuch;
> +
> + public:
> + RegexpFieldProcessor (std::string prefix, Xapian::QueryParser &parser_, notmuch_database_t *notmuch_);
> +
> + ~RegexpFieldProcessor () { };
> +
> + Xapian::Query operator()(const std::string & str);
> +};
> +#endif
> +#endif /* NOTMUCH_REGEXP_FIELDS_H */
> diff --git a/test/T630-regexp-query.sh b/test/T630-regexp-query.sh
> new file mode 100755
> index 00000000..96bd8746
> --- /dev/null
> +++ b/test/T630-regexp-query.sh
> @@ -0,0 +1,81 @@
> +#!/usr/bin/env bash
> +test_description='regular expression searches'
> +. ./test-lib.sh || exit 1
> +
> +add_email_corpus
> +
> +
> +if [ $NOTMUCH_HAVE_XAPIAN_FIELD_PROCESSOR -eq 1 ]; then
Nitpick, I guess you could check for -eq 0 and do test_done in the if
block? See T020-compact.sh. Not having the indent below would be nice.
Otherwise, LGTM, with the caveat that I didn't really study how all the
Xapian field processor stuff is supposed to work...
BR,
Jani.
> +
> + notmuch search --output=messages from:cworth > cworth.msg-ids
> +
> + test_begin_subtest "regexp from search, case sensitive"
> + notmuch search --output=messages from:/carl/ > OUTPUT
> + test_expect_equal_file /dev/null OUTPUT
> +
> + test_begin_subtest "empty regexp or query"
> + notmuch search --output=messages from:/carl/ or from:/cworth/ > OUTPUT
> + test_expect_equal_file cworth.msg-ids OUTPUT
> +
> + test_begin_subtest "non-empty regexp and query"
> + notmuch search from:/cworth at cworth.org/ and subject:patch | notmuch_search_sanitize > OUTPUT
> + cat <<EOF > EXPECTED
> +thread:XXX 2009-11-18 [1/2] Carl Worth| Alex Botero-Lowry; [notmuch] [PATCH] Error out if no query is supplied to search instead of going into an infinite loop (attachment inbox unread)
> +thread:XXX 2009-11-18 [1/2] Carl Worth| Ingmar Vanhassel; [notmuch] [PATCH] Typsos (inbox unread)
> +thread:XXX 2009-11-18 [1/2] Carl Worth| Jan Janak; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
> +thread:XXX 2009-11-18 [1/2] Carl Worth| Keith Packard; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
> +thread:XXX 2009-11-18 [2/5] Carl Worth| Mikhail Gusarov, Keith Packard; [notmuch] [PATCH 1/2] Close message file after parsing message headers (inbox unread)
> +EOF
> + test_expect_equal_file EXPECTED OUTPUT
> +
> + test_begin_subtest "regexp from search, duplicate term search"
> + notmuch search --output=messages from:/cworth/ > OUTPUT
> + test_expect_equal_file cworth.msg-ids OUTPUT
> +
> + test_begin_subtest "long enough regexp matches only desired senders"
> + notmuch search --output=messages 'from:"/C.* Wo/"' > OUTPUT
> + test_expect_equal_file cworth.msg-ids OUTPUT
> +
> + test_begin_subtest "shorter regexp matches one more sender"
> + notmuch search --output=messages 'from:"/C.* W/"' > OUTPUT
> + { echo id:1258544095-16616-1-git-send-email-chris at chris-wilson.co.uk; cat cworth.msg-ids; } > EXPECTED
> + test_expect_equal_file EXPECTED OUTPUT
> +
> + test_begin_subtest "regexp subject search, non-ASCII"
> + notmuch search --output=messages subject:/accentué/ > OUTPUT
> + echo id:877h1wv7mg.fsf at inf-8657.int-evry.fr > EXPECTED
> + test_expect_equal_file EXPECTED OUTPUT
> +
> + test_begin_subtest "regexp subject search, punctuation"
> + notmuch search subject:/\'X\'/ | notmuch_search_sanitize > OUTPUT
> + cat <<EOF > EXPECTED
> +thread:XXX 2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
> +EOF
> + test_expect_equal_file EXPECTED OUTPUT
> +
> + test_begin_subtest "regexp subject search, no punctuation"
> + notmuch search subject:/X/ | notmuch_search_sanitize > OUTPUT
> + cat <<EOF > EXPECTED
> +thread:XXX 2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
> +thread:XXX 2009-11-18 [4/4] Jjgod Jiang, Alexander Botero-Lowry; [notmuch] Mac OS X/Darwin compatibility issues (inbox unread)
> +EOF
> + test_expect_equal_file EXPECTED OUTPUT
> +
> + test_begin_subtest "combine regexp from and subject"
> + notmuch search subject:/-C/ and from:/.an.k/ | notmuch_search_sanitize > OUTPUT
> + cat <<EOF > EXPECTED
> +thread:XXX 2009-11-17 [1/2] Jan Janak| Carl Worth; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
> +EOF
> + test_expect_equal_file EXPECTED OUTPUT
> +
> + test_begin_subtest "regexp error reporting"
> + notmuch search 'from:/unbalanced[/' 1>OUTPUT 2>&1
> + cat <<EOF > EXPECTED
> +notmuch search: A Xapian exception occurred
> +A Xapian exception occurred performing query: Invalid regular expression
> +Query string was: from:/unbalanced[/
> +EOF
> + test_expect_equal_file EXPECTED OUTPUT
> +fi
> +
> +test_done
> --
> 2.11.0
>
> _______________________________________________
> notmuch mailing list
> notmuch at notmuchmail.org
> https://notmuchmail.org/mailman/listinfo/notmuch
More information about the notmuch
mailing list