Alternative (raw) message store (i.e. instead of maildir)

Vladimir Marek Vladimir.Marek at Oracle.COM
Tue Aug 14 09:50:44 PDT 2012


> >> >  - fuse zip stores all changes in memory until unmounted
> >> >  - fuse zip (and libzip for that matter) creates new temporary file when
> >> >    updating archive, which takes considerable time when the archive is
> >> >    very big.
> >>
> >> This isn't much of a hastle if you have maildir per time period and
> >> archive off. Maybe if you sync flags it may be...
> >
> > That might be interesting solution, maildir per time period.
> 
> 
>     Although using a zip file through FUSE as a maildir store is not
> much better in my opinion.
> 
>     This is because it still doesn't solve the syscall overhead. For
> example just going through the list of files to find those that
> changed requires the following syscalls:
>     * reading the next directory entry (which is amortized as it reads
> them in a batch, but the batch size is limited, should we say 1
> syscall per 10 files?);
>     * stat-ing the file;
> 
>     Now by adding FUSE we add an extra context switch for each syscall...
> 
>     Although this issue would be problematic only for reindexing, but still...

That's a price I would be willing to pay to have single file instead of
many.




> > But still
> > fuse zip caches all the data until unmounted. So even with just reading
> > it keeps growing (I hope I'm not accusing fuse zip here, but this is my
> > understanding form the code). This could be simply alleviated by having
> > it periodically unmounted and mounted again (perhaps from cron).
> 
>     I think there is an option for FUSE mount to specify if the data
> should be cached by the kernel or not, as such this shouldn't be a
> problem for FUSE itself, except if the Zip FUSE handler does some
> extra caching.)

To my understanding it's the handler itself.




> >> > Of course this solution would have some disadvantages too, but for me
> >> > the advantages would win. At the moment I'm not sure if I want to
> >> > continue working on that. Maybe if there would be more interested guys
> >>
> >> I'm *really* tempted to investigate making this work for archived
> >> mail. Of course, the list of mounted file systems could get insane
> >> depending on granularity I guess...
> >
> > Well, if your granularity will be one archive per year of mail, it
> > should not be that bad ...
> 
> 
>     On the other hand I strongly sustain having a more optimized
> backend for emails, especially for such cases. For example a
> BerkeleyDB would perfectly fit such a use case, especially if we store
> the body and the headers in separate databases.
> 
>     Just a small experiment, below are the R `summary(emails)` of the
> sizes of my 700k emails:
> ~~~~
>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
>        8     4364     5374    11510     7042 31090000
> ~~~~
> 
>     As seen 75% of the emails are below 7k, and this without any compression...
> 
>     Moreover we could organize the keys so that in a B-Tree structure
> the emails in the same thread are closer together...

Now I'm not sure if you talk about some berkeley-db fuse filesystem or
direct support in notmuch. I don't have enough cycles to modify notmuch,
so I started to look at simpler (codewise) solution ...

To summarize, what I personally want from the mail storage

- ability to read and write mails
- should work with mutt (or mutt-kz)
- simple backup to windows drive (files can't contain double colon ':')

-- 
	Vlad


More information about the notmuch mailing list