one-time-iterators

Fri May 27 11:04:27 PDT 2011

Excerpts from Austin Clements's message of Fri May 27 03:41:44 +0100 2011:
> >> > > Have you tried simply calling list() on your thread
> >> > > iterator to see how expensive it is?  My bet is that it's quite cheap,
> >> > > both memory-wise and CPU-wise.
> >> > Funny thing:
> >> >  q=Database().create_query('*')
> >> >  time tlist = list(q.search_threads())
> >> > raises a NotmuchError(STATUS.NOT_INITIALIZED) exception. For some reason
> >> > the list constructor must read mere than once from the iterator.
> >> > So this is not an option, but even if it worked, it would show
> >> > the same behaviour as my above test..
> >>
> >> Interesting.  Looks like the Threads class implements __len__ and that
> >> its implementation exhausts the iterator.  Which isn't a great idea in
> >> itself, but it turns out that Python's implementation of list() calls
> >> __len__ if it's available (presumably to pre-size the list) before
> >> iterating over the object, so it exhausts the iterator before even
> >> using it.
> >>
> >> That said, if list(q.search_threads()) did work, it wouldn't give you
> >> better performance than your experiment above.
true. Nevertheless I think that list(q.search_threads())
should be equivalent to [t for t in q.search_threads()], which is
something to be fixed in the bindings. Should I file an issue somehow?
Or is enough to state this as a TODO here on the list?

> >> > would it be very hard to implement a Query.search_thread_ids() ?
> >> > This name is a bit off because it had to be done on a lower level.
> >>
> >> Lazily fetching the thread metadata on the C side would probably
> >> address your problem automatically.  But what are you doing that
> >> doesn't require any information about the threads you're manipulating?
> > Agreed. Unfortunately, there seems to be no way to get a list of thread
> > ids or a reliable iterator thereof by using the current python bindings.
> > It would be enough for me to have the ids because then I could
> > search for the few threads I actually need individually on demand.
> 
> There's no way to do that from the C API either, so don't feel left
> out.  ]:--8)  It seems to me that the right solution to your problem
> is to make thread information lazy (effectively, everything gathered
> in lib/thread.cc:_thread_add_message).  Then you could probably
> materialize that iterator cheaply. 
Alright. I'll put this on my mental notmuch wish list and 
hope that someone will have addressed this before I run out of
ideas how to improve my UI and have time to look at this myself.
For now, I go with the [t.get_thread_id for t in q.search_threads()]
approach to cache the thread ids myself and live with the fact that
this takes time for large result sets.

> In fact, it's probably worth
> trying a hack where you put dummy information in the thread object
> from _thread_add_message and see how long it takes just to walk the
> iterator (unfortunately I don't think profiling will help much here
> because much of your time is probably spent waiting for I/O).
I don't think I understand what you mean by dummy info in a thread
object.

> I don't think there would be any downside to doing this for eager
> consumers like the CLI.
one should think so, yes.
/p
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://notmuchmail.org/pipermail/notmuch/attachments/20110527/8bb52855/attachment.pgp>