[RFC PATCH 0/2] Custom query parser

Wed Dec 22 22:00:22 PST 2010

Following is a patch series that implements a custom query parser.
This code is by no means ready for serious review or inclusion in the
tree, but I wanted to spark some discussion and get feedback while
it's still molten.

The query parser is basically compatible with Xapian's, but is
designed to be flexible enough to support (at least) date searches,
saved queries, folder searches (without a schema change), "from:me",
and implicit filters (like defaulting to "-tag:spam").  Such features
are implemented as transformations over an intermediate
representation, keeping the parser itself as simple as possible.  The
code already uses transformers for the usual prefixes and the
type:mail filter, and I have most of the implementation of folder
searches in another branch.

If you'd like to play with it, it's on the qparser branch at
http://awakening.csail.mit.edu/git/notmuch.git.  It's not complete,
but it's close (notably, stemming is missing).

The grammar is approximately compatible with Xapian's query parser.
To my knowledge, it differs in the following ways:

1) The way this parser separates terms is much simpler and, I believe,
   more predictable.  In Xapian, many things can separate terms, and
   exactly what is context-dependent.  For example, "x/y" is parsed as
   two terms in a phrase, "x#y" is parsed as two separate terms "x"
   and "y", and "x# y" is parsed as two separate terms "x#" and "y".
   This parser instead splits the query into "term phrases", which are
   separated only by whitespace, quote, left paren, or right paren.
   These term phrases are then broken into terms using the same
   Xapian::TermGenerator that tokenizes documents.  If this results in
   multiple terms, they're tied into a phrase in the final query.
   Thus, "x/y" and "x# y" are handled as in Xapian, but "x#y" is
   parsed as a phrase.

2) This parser handles errors differently.  Because queries are
   roughly natural language, I feel they should never fail to parse
   (has Google ever told you that your query contains a syntax
   error?), so this parser tries to do reasonable things in all cases.
   Xapian's parser has a strange approach.  For syntax errors
   involving boolean operators (for example "x AND"), it returns a
   parse error.  For other syntax errors (for example, "x) OR y"),
   Xapian will *retry* the parse from scratch with no parser flags set
   (no operators, parens, or quotes).

3) The handling of NEAR and ADJ is quite different and possibly wrong.
   I didn't realize how subtle these operators were until I'd already
   implemented something completely different from Xapian's logic.

The code is much simpler than Xapian's query parser because it elides
several features that notmuch doesn't use (such as multi-term
synonyms), it reuses Xapian's TermGenerator to parse terms (instead of
copy-pasting and tweaking the code), and because I placed the boundary
between the lexer and the parser differently.  This parser is under
1000 lines, while Xapain's is 2000 lines plus a 5000 line parser
generator.