Alternative (raw) message store (i.e. instead of maildir)

Ciprian Dorin Craciun ciprian.craciun at gmail.com
Tue Aug 14 09:38:22 PDT 2012


On Tue, Aug 14, 2012 at 7:04 PM, Vladimir Marek
<Vladimir.Marek at oracle.com> wrote:
>> >  - fuse zip stores all changes in memory until unmounted
>> >  - fuse zip (and libzip for that matter) creates new temporary file when
>> >    updating archive, which takes considerable time when the archive is
>> >    very big.
>>
>> This isn't much of a hastle if you have maildir per time period and
>> archive off. Maybe if you sync flags it may be...
>
> That might be interesting solution, maildir per time period.


    Although using a zip file through FUSE as a maildir store is not
much better in my opinion.

    This is because it still doesn't solve the syscall overhead. For
example just going through the list of files to find those that
changed requires the following syscalls:
    * reading the next directory entry (which is amortized as it reads
them in a batch, but the batch size is limited, should we say 1
syscall per 10 files?);
    * stat-ing the file;

    Now by adding FUSE we add an extra context switch for each syscall...

    Although this issue would be problematic only for reindexing, but still...


> But still
> fuse zip caches all the data until unmounted. So even with just reading
> it keeps growing (I hope I'm not accusing fuse zip here, but this is my
> understanding form the code). This could be simply alleviated by having
> it periodically unmounted and mounted again (perhaps from cron).

    I think there is an option for FUSE mount to specify if the data
should be cached by the kernel or not, as such this shouldn't be a
problem for FUSE itself, except if the Zip FUSE handler does some
extra caching.)


>> > Of course this solution would have some disadvantages too, but for me
>> > the advantages would win. At the moment I'm not sure if I want to
>> > continue working on that. Maybe if there would be more interested guys
>>
>> I'm *really* tempted to investigate making this work for archived
>> mail. Of course, the list of mounted file systems could get insane
>> depending on granularity I guess...
>
> Well, if your granularity will be one archive per year of mail, it
> should not be that bad ...


    On the other hand I strongly sustain having a more optimized
backend for emails, especially for such cases. For example a
BerkeleyDB would perfectly fit such a use case, especially if we store
the body and the headers in separate databases.

    Just a small experiment, below are the R `summary(emails)` of the
sizes of my 700k emails:
~~~~
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
       8     4364     5374    11510     7042 31090000
~~~~

    As seen 75% of the emails are below 7k, and this without any compression...

    Moreover we could organize the keys so that in a B-Tree structure
the emails in the same thread are closer together...

    Ciprian.


More information about the notmuch mailing list