[PATCH v2] emacs: bad regexp @ `notmuch-search-process-filter'

Sat Jul 16 08:07:12 PDT 2011

On Wed, 13 Jul 2011 14:57:21 -0400, Austin Clements <amdragon at MIT.EDU> wrote:
> Quoth Pieter Praet on Jul 13 at  4:16 pm:
> > On Mon, 11 Jul 2011 17:05:32 -0400, Austin Clements <amdragon at MIT.EDU> wrote:
> > > Quoth Pieter Praet on Jul 11 at 10:43 pm:
> > > > TL;DR: I can haz regex pl0x?
> > > 
> > > Oof, what a pain.  I'm happy to change the output format of search; I
> > > hadn't realized how difficult it would be to parse.  In fact, I'm not
> > > sure it's even parsable by regexp, because the message ID's themselves
> > > could contain parens.
> > > 
> > > So what would be a good format?  One possibility would be to
> > > NULL-delimit the query part; as distasteful as I find that, this part
> > > of the search output isn't meant for user consumption.  Though I fear
> > > this is endemic to the dual role the search output currently plays as
> > > both user and computer readable.
> > > 
> > > I've also got the code to do everything using document ID's instead of
> > > message ID's.  As a side-effect, it makes the search output clean and
> > > readily parsable since document ID's are just numbers.  Hence, there
> > > are no quoting or escaping issues (plus the output is much more
> > > compact).  I haven't sent this to the list yet because I haven't had a
> > > chance to benchmark it and determine if the performance benefits make
> > > exposing document ID's worthwhile.
> > 
> > Jamie Zawinski once said/wrote [1]:
> >   'Some people, when confronted with a problem, think "I know,
> >   I'll use regular expressions." Now they have two problems.'
> > 
> > With this in mind, I set out to get rid of this whole regex mess altogether,
> > by populating the search buffer using Notmuch's JSON output instead of doing
> > brittle text matching tricks.
> > 
> > Looking for some documentation, I stumbled upon a long-forgotten gem [2].
> > 
> > David's already done pretty much all of the work for us!
> 
> Yes, similar thoughts were running through my head as I futzed with
> the formatting for this.  My concern with moving to JSON for search
> buffers is that parsing it is about *30 times slower* than the current
> regexp-based approach (0.6 seconds versus 0.02 seconds for a mere 1413
> result search buffer).  I think JSON makes a lot of sense for show
> buffers because there's generally less data and it has a lot of
> complicated structure.  Search results, on the other hand, have a very
> simple, regular, and constrained structure, so JSON doesn't buy us
> nearly as much.

That seems about right. Using the entire Notmuch mailing list archive,
processing JSON ends up taking 23x longer (see test in att).

> JSON is hard to parse because, like the text search output, it's
> designed for human consumption (of course, unlike the text search
> output, it's also designed for computer consumption).  There's
> something to be said for the debuggability and generality of this and
> JSON is very good for exchanging small objects, but it's a remarkably
> inefficient way to exchange large amounts of data between two
> programs.
> 
> I guess what I'm getting at, though it pains me to say it, is perhaps
> search needs a fast, computer-readable interchange format.  The
> structure of the data is so simple and constrained that this could be
> altogether trivial.

I guess that's our only option then. Could you implement it for me?
I'll make sure to rebase my patch series in an acceptable time frame.

An extra output format shouldn't be that much of a problem though, if we
further compartmentalize the code. What are your thoughts on (in the
long term) moving to a plugin-based architecture? Eg. enable something
like this:

  ./input/{Maildir, ...}
  ./output/{plain, JSON, ...}
  ./filters/{crypto, ...}
  ./backends/(Xapian, ...)
  ./uis/{Emacs, VIM, web, ...}

> Or maybe I need a faster computer.

That's what M$ Tech Support would want you to believe :)
What we need is slower computers, so devs are forced to count cycles again.
The rise of netbooks has thankfully done wonders in this respect.

> If anyone is curious, here's how I timed the parsing.
> 
> (defmacro time-it (code)
>   `(let ((start-time (get-internal-run-time)))
>      ,code
>      (float-time (time-subtract (get-internal-run-time) start-time))))
> 
> (with-current-buffer "json"
>   (goto-char (point-min))
>   (time-it (json-read)))
> 
> (with-current-buffer "text"
>   (goto-char (point-min))
>   (time-it
>    (while (re-search-forward "^\\(thread:[0-9A-Fa-f]*\\) \\([^][]*\\) \\(\\[[0-9/]*\\]\\) \\([^;]*\\); \\(.*\\) (\\([^()]*\\))$" nil t))))

Peace

-- 
Pieter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: regexp-vs-json.org
Type: application/octet-stream
Size: 1436 bytes
Desc: not available
URL: <http://notmuchmail.org/pipermail/notmuch/attachments/20110716/93813def/attachment.obj>