Mail archives in Git using ssoma
W. Trevor King
wking at tremily.us
Fri Nov 7 11:03:21 PST 2014
Hello everyone :),
I like Git, so when folks suggest storing things in Git, I'm usually
excited ;). Eric Wong has been working on some tools to store email
in a Git repository, and his client-side code is ssoma [1]. I wanted
a bit more metadata than the stock ssoma-mda [2], and ended up just
writing a ssoma-mda in Python [3]. It needs Python ≥3.4 and pygit2.
I had pygit2 already installed for Python 3.3 (which gave me a local
libgit2), so I used pip to install it for 3.4:
$ python3.4 -m ensurepip --user
$ pip3.4 install --user pygit2
Then I grabbed the archives, and pulled them into Git:
$ wget http://notmuchmail.org/archives/notmuch.mbox
$ git init --bare notmuch-archives.git
$ cd notmuch-archives.git
$ python3.4
>>> import email.utils
>>> import mailbox
>>> import ssoma_mda
>>> mbox = mailbox.mbox('../notmuch.mbox', factory=None, create=False)
>>> messages = sorted(mbox, key=lambda m: email.utils.mktime_tz(email.utils.parsedate_tz(m['date'])))
>>> for message in messages:
... if ((message['message-id'] == '<m2k4gmyjer.fsf at ecocode.net>' and
... message['X-List-Received-Date'] == 'Sat, 26 Feb 2011 14:23:34 -0000') or
... (message['message-id'] == '<4EDF728E.3050204 at gmail.com>' and
... message['X-List-Received-Date'] == 'Wed, 07 Dec 2011 14:05:16 -0000') or
... (message['message-id'] == <4FE369F2.5080804 at gmail.com>' and
... message['X-List-Received-Date'] == 'Thu, 21 Jun 2012 18:38:07 -0000') or
... (message['message-id'] == '<5122353D.4060601 at gmail.com>' and
... message['X-List-Received-Date'] == 'Mon, 18 Feb 2013 14:06:12 -0000') or
... (message['message-id'] == '<CA+eQo_1hMsTD4+6ifqgEQXW0_qYXGOdfkO6tBuGQKV+W7OSaKA at mail.gmail.com>' and
... message['X-List-Received-Date'] == 'Wed, 24 Apr 2013 18:09:55 -0000') or
... (message['message-id'] == '<527B9E8C.5000001 at krugs.de>' and
... message['X-List-Received-Date'] == 'Thu, 07 Nov 2013 14:07:32 -0000') or
... (message['message-id'] == '<1399645162-8653-1-git-send-email-wael.nasreddine at gmail.com>' and
... message['X-List-Received-Date'] == 'Fri, 09 May 2014 14:19:36 -0000') or
... (message['message-id'] == '<m2mw9xkyvg.fsf at krugs.de>' and
... message['X-List-Received-Date'] == 'Thu, 18 Sep 2014 10:27:35 -0000') or
... (message['message-id'] == '<cover.1411379395.git.jani at nikula.org>' and
... message['X-List-Received-Date'] != 'Mon, 22 Sep 2014 09:54:16 -0000')):
... continue
... ssoma_mda.deliver(message=message, once=True)
>>> ^D
On my 1.1GHz Intel Celeron 847 Sandy Bridge netbook, that took about
half an hour. The initial repository was large:
$ du -hs .
394M .
But packing it up made it small:
$ git gc --aggressive
du -hs .
51M .
With a few less images than the mbox:
$ git log --oneline | wc -l
19650
Compared with 19660 messages in the mbox at 107 MB (160 MB for the
associated Maildir).
The messages I dropped removed duplicate Message-IDs:
* id:m2k4gmyjer.fsf at ecocode.net had different received dates:
-X-List-Received-Date: Sat, 26 Feb 2011 14:12:20 -0000
+X-List-Received-Date: Sat, 26 Feb 2011 14:23:34 -0000
but no significant differences.
* id:4EDF728E.3050204 at gmail.com had a real address in the
first-to-arrive version:
-X-List-Received-Date: Wed, 07 Dec 2011 14:10:13 -0000
-> <4winter at informatik.uni-hamburg.de>
an an obfuscated one in the second-to-arrive version:
+X-List-Received-Date: Wed, 07 Dec 2011 14:05:16 -0000
+> <4winter-jNDFPZUTrfQBEfOqpokbeYV0Y/DQsy6Ps0AfqQuZ5sE at public.gmane.org>
* id:4FE369F2.5080804 at gmail.com had the same:
-X-List-Received-Date: Thu, 21 Jun 2012 18:37:54 -0000
-> <R.M.Krug at gmail.com
+X-List-Received-Date: Thu, 21 Jun 2012 18:38:07 -0000
-> <mailto:R.M.Krug at gmail.com>> wrote:
* id:5122353D.4060601 at gmail.com had different received dates:
-X-List-Received-Date: Mon, 18 Feb 2013 14:06:05 -0000
+X-List-Received-Date: Mon, 18 Feb 2013 14:06:12 -0000
but no significant differences.
* id:CA+eQo_1hMsTD4+6ifqgEQXW0_qYXGOdfkO6tBuGQKV+W7OSaKA at mail.gmail.com
had different MIME boundaries:
-Content-Type: multipart/alternative; boundary=f46d043be11ac45a0904db1f3428
-X-List-Received-Date: Wed, 24 Apr 2013 18:09:46 -0000
+Content-Type: multipart/alternative; boundary=e89a8f646ff3faa11d04db1f3294
+X-List-Received-Date: Wed, 24 Apr 2013 18:09:55 -0000
but no significant differences.
* id:527B9E8C.5000001 at krugs.de had obfuscated addresses:
-X-List-Received-Date: Thu, 07 Nov 2013 14:07:33 -0000
-> Rainer M Krug <Rainer at krugs.de> writes:
+X-List-Received-Date: Thu, 07 Nov 2013 14:07:32 -0000
+> Rainer M Krug <Rainer-vfylz/Ys1k4 at public.gmane.org> writes:
* id:1399645162-8653-1-git-send-email-wael.nasreddine at gmail.com had
additional content in the later submission:
-Subject: [PATCH] Add Travis-CI config file.
-Date: Fri, 9 May 2014 07:19:22 -0700
-X-List-Received-Date: Fri, 09 May 2014 14:19:36 -0000
- .travis.yml | 10 ++++++++++
- 1 file changed, 10 insertions(+)
+Subject: [PATCH v2] Enable Travis-CI as a backup continuous integration
+ service.
+Date: Fri, 9 May 2014 14:44:50 -0700
+X-List-Received-Date: Fri, 09 May 2014 21:45:16 -0000
+
+The v2 adds a notification section to send failure (or back to passing) notifications
+to the mailing list and to the IRC channel
+
+ .travis.yml | 13 +++++++++++++
+ 1 file changed, 13 insertions(+)
* id:m2mw9xkyvg.fsf at krugs.de had an obfuscated adderss and different signature:
-X-List-Received-Date: Thu, 18 Sep 2014 10:27:31 -0000
->> guyzmo <guyzmo at m0g.net> writes:
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
-iQEcBAEBAgAGBQJUGrN3AAoJENvXNx4PUvmC4J0IAN9Wf+0ArvirJCoewItnEZoo
-ySg4VRP7uWVqDxHVl5N9XFv4YE2bZ2E2eMGvbo6v7I82lhqeR5dauZhlgCMki+ZI
+X-List-Received-Date: Thu, 18 Sep 2014 10:27:35 -0000
+>> guyzmo <guyzmo-kMjww5mZloE at public.gmane.org> writes:
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
+iQEcBAEBAgAGBQJUGrN4AAoJENvXNx4PUvmC6LsIAIaFrd4MFnm8EixrAHPGfW6j
+L3KNG7Dv+hQuNRUN6qn+emZHI8wX4O74HOZOpZWkE09CmjkPJBmf7IuJwtz2ONbM
* id:cover.1411379395.git.jani at nikula.org came in three times, with
three dates, but no significant differences:
Date: Mon, 22 Sep 2014 11:54:20 +0200
X-List-Received-Date: Mon, 22 Sep 2014 09:54:16 -0000
Date: Mon, 22 Sep 2014 11:54:42 +0200
X-List-Received-Date: Mon, 22 Sep 2014 09:54:37 -0000
Date: Mon, 22 Sep 2014 11:54:51 +0200
X-List-Received-Date: Mon, 22 Sep 2014 09:54:49 -0000
Anyhow, I've pushed the Git archive [4,5] if anyone wants to play
around with ssoma. I think this would be a nice backend for folks
building notmuch-based web archives, and pulling from Git is easier
than downloading a new mbox ;).
Cheers,
Trevor
[1]: http://ssoma.public-inbox.org/README
[2]: http://public-inbox.org/meta/m/ec8f54cf6451eef6e9f59eff691cd9002f4fdf65.html
[3]: http://git.tremily.us/?p=ssoma-mda.git;a=shortlog;h=refs/heads/python
I have an uncommitted patch to work around http://bugs.python.org/issue22684
[4]: http://git.tremily.us/?p=notmuch-archives.git
[5]: git://tremily.us/notmuch-archives.git
--
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <http://notmuchmail.org/pipermail/notmuch/attachments/20141107/34578f0e/attachment-0001.pgp>
More information about the notmuch
mailing list