nbook: a notmuch based address book written in python

Justus Winter 4winter at informatik.uni-hamburg.de
Mon Oct 15 03:58:30 PDT 2012


Hi Suvayu :)

welcome to notmuch and python.

Quoting Patrick Totzke (2012-10-13 18:58:51)
> > > And If I look for my own name, this takes over a minute,
> > > eventually dying. This could be an issue with libnotmuch though.
> > > Possibly, your algorithm takes very long and then reads from an initially
> > > opened Database object again, which was invalidated by concurrent writes of other processes..

Hm no, see below.

> > > -------------------------------
> > > [~] time nbook Patrick                     
> > > 
> > > Error opening /home/pazz/mail/gmail/[Google Mail].All Mail/cur/1330682270_0.12958.megatron,U=8766,FMD5=66ff6a8bc18a8a3ac4b311daa93d358a:2,S: Too many open files
> > > Traceback (most recent call last):
> > >   File "/home/pazz/bin/nbook", line 167, in <module>
> > >   File "/home/pazz/bin/nbook", line 71, in __init__
> > >   File "/home/pazz/.local/lib/python2.7/site-packages/notmuch/message.py", line 233, in get_header
> > > notmuch.errors.NullPointerError
> > > Error in sys.excepthook:
> > > Traceback (most recent call last):
> > >   File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line 66, in apport_excepthook
> > > ImportError: No module named fileutils
> > > 
> > > Original exception was:
> > > Traceback (most recent call last):
> > >   File "/home/pazz/bin/nbook", line 167, in <module>
> > >   File "/home/pazz/bin/nbook", line 71, in __init__
> > >   File "/home/pazz/.local/lib/python2.7/site-packages/notmuch/message.py", line 233, in get_header
> > > notmuch.errors.NullPointerError
> > > nbook Patrick  3.20s user 5.47s system 12% cpu 1:11.65 total
> > > ------------------------------------
> > > 
> > 
> > Yes someone else pointed this out too.  Again I'm not sure how to
> > proceed here.  I had a quick look at this last week and it seemed to me
> > the limitation comes from within the python bindings for notmuch.  Do
> > you have any ideas?
> 
> As mentioned before, I think you invalidate the Database object concurrently
> while your long-running algorithm goes through all messages.
> Xapian doesn't handle concurrent access to the index like a normal™ database would.
> This means you are notified by this error that some changes were detected.
> Maybe the error message should be more telling here though. Teythoon?

The reason for this error is exactly what the error message says, you
are opening to many files. Check out this limit using ulimit -n:

% ulimit -n
4096

This problem is subtle. Here is a minimal test case:

~~~ snip ~~~
import notmuch

with notmuch.Database() as db:
    query = notmuch.Query(db, 'a').search_messages()
    for msg in query:
        msg.get_header('from')

with notmuch.Database() as db:
    query = notmuch.Query(db, 'a').search_messages()
    for msg in list(query):
        msg.get_header('from')
~~~ snap ~~~

% python test.py
Error opening /home/teythoon/Maildir/.lists.notmuch/cur/1323251462.M53044P18514.thinkbox,S=7306,W=7466:2,: Too many open files
Traceback (most recent call last):
  File "test.py", line 11, in <module>
    msg.get_header('from')
  File "/home/teythoon/.local/lib/python2.7/site-packages/notmuch/message.py", line 237, in get_header
    raise NullPointerError()
notmuch.errors.NullPointerError

Observe that it blows up in line 11, the first version works. The only
difference is that the second version creates a list from the notmuch
query. This prevents the garbage collector from collecting the message
objects and thus closing the file handles. So here's your fix:

~~~ snip ~~~
diff --git a/nbook b/nbook
index 387c71d..b3d4fd6 100755
--- a/nbook
+++ b/nbook
@@ -173,7 +173,7 @@ class AddressHeaders(object):
 # Search
 db = Database()
 query = Query(db, 'from:"{0}" or to:"{0}"'.format(querystr))
-msgs = list(query.search_messages())
+msgs = query.search_messages()
 
 addresses = AddressHeaders(msgs, querystr)
 print addresses
~~~ snap ~~~

A few more comments:

> from notmuch import *

Please avoid * imports, they prevent tools like pyflakes from checking
whether you accidentally misspelled any identifiers.

> pyversion = float('%d.%d' % (sys.version_info.major, sys.version_info.minor))
> if pyversion < 2.7:

Converting this to float feels wrong. Consider doing sth like

if sys.version_info.major > 2 or (sys.version_info.major == 2 and sys.version_info.minor >= 7):

>     print '`nbook\' needs Python 2.7 or higher for argparse'

Note that in py3k print is a function and not a statement, so you need
to use braces. Consider dropping this at the beginning of all your
python files to make py2.7 use the new features:

from __future__ import print_function, absolute_import, unicode_literals

>     exit(-1)

exit is not a builtin function. You have to use sys.exit. Tools like
pyflakes can spot this kind of mistakes. Also, sys.exit also accepts a
string as argument which it prints to stderr before exiting with an
error code.

>         self.__fromhdr__ += ',' + msg.get_header('from')

Hm, this is somewhat unpythonic. It used to be the case that building
strings this way was a lot slower than building a list and then
joining it on a delimiter of your choice
(i.e. ','.join(from_headers)). This is (was?) because strings are
immutable in python and constantly creating strings just to throw them
away in the next iteration puts a lot of pressure on the memory
management system. Somewhat recent discussion here:

http://stackoverflow.com/questions/1316887/what-is-the-most-efficient-string-concatenation-method-in-python

>     def print_addrs(self, fmtstr='', query=''):
>         if '' == fmtstr: fmtstr = '%s    %s\n'

Ok, several things here:

* The comparison looks weird, you are using the string constant as the
  first operand. While this is technically not wrong, it is somewhat
  unpythonic b/c if you read it out loud (''if the empty string is
  equal to fmtstr'') it somewhat bends the 1:1 mapping of the semantic
  of your program and the English sentence. It looks like this c hack
  that is actually unnecessary in python b/c you cannot use the
  assignment operator as a value (except for a=b=c=0 style
  assignments).

* Please don't put multiple statements in one line.

* This can be written shorter and more idiomatic (yay keyword
  arguments):

    def print_addrs(self, fmtstr='%s    %s\n', query=''):
        [...]

Happy hacking :)
Justus


More information about the notmuch mailing list