Searching messages by size with notmuch

Fri Jan 22 09:02:35 PST 2016

On Fri 2016-01-22 10:13:18 -0500, Antoine Amarilli wrote:

> After chatting on #notmuch, I wanted to suggest a feature which would be
> useful, at least to me: searching for messages by size.
>
> My use case would be to look for long messages, but dkg on IRC mentioned
> that it could also be useful to clean up messages to save disk space.
>
> It is unclear whether the size of a message should be defined as that of
> a single copy of the message, or that of all copies; and it is unclear
> whether it should be the total size (for my purposes I would have been
> interested in the size of the plaintext part of the message only).

Note that "the plaintext part of the message" might mean the sum of
multiple plaintext parts too -- consider this message as an example:

-------------- next part --------------
A non-text attachment was scrubbed...
Name: stock_smiley-7.png
Type: image/png
Size: 4296 bytes
Desc: not available
URL: <http://notmuchmail.org/pipermail/notmuch/attachments/20160122/7e353d9a/attachment.png>
-------------- next part --------------

Would you count the total of all related text/plain parts?  or all
text/* parts?  or, if you have a multipart/alternative node in the MIME
tree, would you report it as the maximum of any of the text/*
alternatives? 

if you're really interested in textual analysis, then the "size of
plaintext" might instead be better measured in words or paragraphs,
rather than octets.  Also, you might want to ignore quoted text and only
measure non-quoted text (this is particularly relevant for
conversations where people top-post and don't trim, or else you're
actually measuring just how deep in the thread a given message is).

I'm not trying to say that these metrics are impossible, just pointing
out that the underlying data formats can be much more complicated than
most people think about with mail.  The decision about what to count and
how to count it greatly effects the possible use cases.

> Ideally I'd say that all of these could make sense.

They could indeed, but I think we could motivate this much better as
initial work by picking one particular use case, and implementing it.
The work would be something like:

 a) choose the metric we care about, and describe concretely how to
    calculate it for a given rfc822 file.

 b) assign and name a new notmuch_value_t to to identify the metric

 c) update notmuch_database_add_message to insert that new value when a
    file is added

 d) consider what workflows are available to update the database for
    already-indexed documents that do not have this value.

 e) resolve what to do about documents associated with multiple filenames

 f) define how to include it in searches (this is probably a
    NumberValueRangeProcessor, see
    file:///usr/share/doc/xapian-doc/valueranges.html)

 g) update documentation for notmuch cli tools.

If you work through this process for one particular "message size" use
case and document the steps, then we could presumably handle the other
"message size" metrics in exactly the same way, modifying only steps (a)
and (e) depending on the metric.

> Would anyone else on the list be interested by such a feature?

I'm definitely interested, but don't have a lot of time to work on it.

    --dkg