Alternative (raw) message store (i.e. instead of maildir)
Vladimir Marek
Vladimir.Marek at Oracle.COM
Tue Aug 14 09:50:44 PDT 2012
> >> > - fuse zip stores all changes in memory until unmounted
> >> > - fuse zip (and libzip for that matter) creates new temporary file when
> >> > updating archive, which takes considerable time when the archive is
> >> > very big.
> >>
> >> This isn't much of a hastle if you have maildir per time period and
> >> archive off. Maybe if you sync flags it may be...
> >
> > That might be interesting solution, maildir per time period.
>
>
> Although using a zip file through FUSE as a maildir store is not
> much better in my opinion.
>
> This is because it still doesn't solve the syscall overhead. For
> example just going through the list of files to find those that
> changed requires the following syscalls:
> * reading the next directory entry (which is amortized as it reads
> them in a batch, but the batch size is limited, should we say 1
> syscall per 10 files?);
> * stat-ing the file;
>
> Now by adding FUSE we add an extra context switch for each syscall...
>
> Although this issue would be problematic only for reindexing, but still...
That's a price I would be willing to pay to have single file instead of
many.
> > But still
> > fuse zip caches all the data until unmounted. So even with just reading
> > it keeps growing (I hope I'm not accusing fuse zip here, but this is my
> > understanding form the code). This could be simply alleviated by having
> > it periodically unmounted and mounted again (perhaps from cron).
>
> I think there is an option for FUSE mount to specify if the data
> should be cached by the kernel or not, as such this shouldn't be a
> problem for FUSE itself, except if the Zip FUSE handler does some
> extra caching.)
To my understanding it's the handler itself.
> >> > Of course this solution would have some disadvantages too, but for me
> >> > the advantages would win. At the moment I'm not sure if I want to
> >> > continue working on that. Maybe if there would be more interested guys
> >>
> >> I'm *really* tempted to investigate making this work for archived
> >> mail. Of course, the list of mounted file systems could get insane
> >> depending on granularity I guess...
> >
> > Well, if your granularity will be one archive per year of mail, it
> > should not be that bad ...
>
>
> On the other hand I strongly sustain having a more optimized
> backend for emails, especially for such cases. For example a
> BerkeleyDB would perfectly fit such a use case, especially if we store
> the body and the headers in separate databases.
>
> Just a small experiment, below are the R `summary(emails)` of the
> sizes of my 700k emails:
> ~~~~
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> 8 4364 5374 11510 7042 31090000
> ~~~~
>
> As seen 75% of the emails are below 7k, and this without any compression...
>
> Moreover we could organize the keys so that in a B-Tree structure
> the emails in the same thread are closer together...
Now I'm not sure if you talk about some berkeley-db fuse filesystem or
direct support in notmuch. I don't have enough cycles to modify notmuch,
so I started to look at simpler (codewise) solution ...
To summarize, what I personally want from the mail storage
- ability to read and write mails
- should work with mutt (or mutt-kz)
- simple backup to windows drive (files can't contain double colon ':')
--
Vlad
More information about the notmuch
mailing list