Mail archives in Git using ssoma

W. Trevor King wking at tremily.us
Fri Nov 7 11:03:21 PST 2014


Hello everyone :),

I like Git, so when folks suggest storing things in Git, I'm usually
excited ;).  Eric Wong has been working on some tools to store email
in a Git repository, and his client-side code is ssoma [1].  I wanted
a bit more metadata than the stock ssoma-mda [2], and ended up just
writing a ssoma-mda in Python [3].  It needs Python ≥3.4 and pygit2.
I had pygit2 already installed for Python 3.3 (which gave me a local
libgit2), so I used pip to install it for 3.4:

  $ python3.4 -m ensurepip --user
  $ pip3.4 install --user pygit2

Then I grabbed the archives, and pulled them into Git:

  $ wget http://notmuchmail.org/archives/notmuch.mbox
  $ git init --bare notmuch-archives.git
  $ cd notmuch-archives.git
  $ python3.4
  >>> import email.utils
  >>> import mailbox
  >>> import ssoma_mda
  >>> mbox = mailbox.mbox('../notmuch.mbox', factory=None, create=False)
  >>> messages = sorted(mbox, key=lambda m: email.utils.mktime_tz(email.utils.parsedate_tz(m['date'])))
  >>> for message in messages:
  ...     if ((message['message-id'] == '<m2k4gmyjer.fsf at ecocode.net>' and
  ...             message['X-List-Received-Date'] == 'Sat, 26 Feb 2011 14:23:34 -0000') or
  ...           (message['message-id'] == '<4EDF728E.3050204 at gmail.com>' and
  ...             message['X-List-Received-Date'] == 'Wed, 07 Dec 2011 14:05:16 -0000') or
  ...           (message['message-id'] == <4FE369F2.5080804 at gmail.com>' and
  ...             message['X-List-Received-Date'] == 'Thu, 21 Jun 2012 18:38:07 -0000') or
  ...           (message['message-id'] == '<5122353D.4060601 at gmail.com>' and
  ...             message['X-List-Received-Date'] == 'Mon, 18 Feb 2013 14:06:12 -0000') or
  ...           (message['message-id'] == '<CA+eQo_1hMsTD4+6ifqgEQXW0_qYXGOdfkO6tBuGQKV+W7OSaKA at mail.gmail.com>' and
  ...             message['X-List-Received-Date'] == 'Wed, 24 Apr 2013 18:09:55 -0000') or
  ...           (message['message-id'] == '<527B9E8C.5000001 at krugs.de>' and
  ...             message['X-List-Received-Date'] == 'Thu, 07 Nov 2013 14:07:32 -0000') or
  ...           (message['message-id'] == '<1399645162-8653-1-git-send-email-wael.nasreddine at gmail.com>' and
  ...             message['X-List-Received-Date'] == 'Fri, 09 May 2014 14:19:36 -0000') or
  ...           (message['message-id'] == '<m2mw9xkyvg.fsf at krugs.de>' and
  ...             message['X-List-Received-Date'] == 'Thu, 18 Sep 2014 10:27:35 -0000') or
  ...           (message['message-id'] == '<cover.1411379395.git.jani at nikula.org>' and
  ...             message['X-List-Received-Date'] != 'Mon, 22 Sep 2014 09:54:16 -0000')):
  ...         continue
  ...     ssoma_mda.deliver(message=message, once=True)
  >>> ^D

On my 1.1GHz Intel Celeron 847 Sandy Bridge netbook, that took about
half an hour.  The initial repository was large:

  $ du -hs .
  394M    .

But packing it up made it small:

  $ git gc --aggressive
  du -hs .
  51M     .

With a few less images than the mbox:

  $ git log --oneline | wc -l
  19650

Compared with 19660 messages in the mbox at 107 MB (160 MB for the
associated Maildir).

The messages I dropped removed duplicate Message-IDs:

* id:m2k4gmyjer.fsf at ecocode.net had different received dates:

    -X-List-Received-Date: Sat, 26 Feb 2011 14:12:20 -0000
    +X-List-Received-Date: Sat, 26 Feb 2011 14:23:34 -0000

  but no significant differences.

* id:4EDF728E.3050204 at gmail.com had a real address in the
  first-to-arrive version:

    -X-List-Received-Date: Wed, 07 Dec 2011 14:10:13 -0000
    -> <4winter at informatik.uni-hamburg.de>

  an an obfuscated one in the second-to-arrive version:

    +X-List-Received-Date: Wed, 07 Dec 2011 14:05:16 -0000
    +> <4winter-jNDFPZUTrfQBEfOqpokbeYV0Y/DQsy6Ps0AfqQuZ5sE at public.gmane.org>

* id:4FE369F2.5080804 at gmail.com had the same:

    -X-List-Received-Date: Thu, 21 Jun 2012 18:37:54 -0000
    -> <R.M.Krug at gmail.com

    +X-List-Received-Date: Thu, 21 Jun 2012 18:38:07 -0000
    -> <mailto:R.M.Krug at gmail.com>> wrote:

* id:5122353D.4060601 at gmail.com had different received dates:

    -X-List-Received-Date: Mon, 18 Feb 2013 14:06:05 -0000
    +X-List-Received-Date: Mon, 18 Feb 2013 14:06:12 -0000

  but no significant differences.

* id:CA+eQo_1hMsTD4+6ifqgEQXW0_qYXGOdfkO6tBuGQKV+W7OSaKA at mail.gmail.com
  had different MIME boundaries:

    -Content-Type: multipart/alternative; boundary=f46d043be11ac45a0904db1f3428
    -X-List-Received-Date: Wed, 24 Apr 2013 18:09:46 -0000

    +Content-Type: multipart/alternative; boundary=e89a8f646ff3faa11d04db1f3294
    +X-List-Received-Date: Wed, 24 Apr 2013 18:09:55 -0000

  but no significant differences.

* id:527B9E8C.5000001 at krugs.de had obfuscated addresses:

    -X-List-Received-Date: Thu, 07 Nov 2013 14:07:33 -0000
    -> Rainer M Krug <Rainer at krugs.de> writes:

    +X-List-Received-Date: Thu, 07 Nov 2013 14:07:32 -0000
    +> Rainer M Krug <Rainer-vfylz/Ys1k4 at public.gmane.org> writes:

* id:1399645162-8653-1-git-send-email-wael.nasreddine at gmail.com had
  additional content in the later submission:

    -Subject: [PATCH] Add Travis-CI config file.
    -Date: Fri,  9 May 2014 07:19:22 -0700
    -X-List-Received-Date: Fri, 09 May 2014 14:19:36 -0000
    - .travis.yml | 10 ++++++++++
    - 1 file changed, 10 insertions(+)

    +Subject: [PATCH v2] Enable Travis-CI as a backup continuous integration
    +       service.
    +Date: Fri,  9 May 2014 14:44:50 -0700
    +X-List-Received-Date: Fri, 09 May 2014 21:45:16 -0000
    +
    +The v2 adds a notification section to send failure (or back to passing) notifications
    +to the mailing list and to the IRC channel
    +
    + .travis.yml | 13 +++++++++++++
    + 1 file changed, 13 insertions(+)

* id:m2mw9xkyvg.fsf at krugs.de had an obfuscated adderss and different signature:

    -X-List-Received-Date: Thu, 18 Sep 2014 10:27:31 -0000
    ->> guyzmo <guyzmo at m0g.net> writes:
     -----BEGIN PGP SIGNATURE-----
     Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
    -iQEcBAEBAgAGBQJUGrN3AAoJENvXNx4PUvmC4J0IAN9Wf+0ArvirJCoewItnEZoo
    -ySg4VRP7uWVqDxHVl5N9XFv4YE2bZ2E2eMGvbo6v7I82lhqeR5dauZhlgCMki+ZI

    +X-List-Received-Date: Thu, 18 Sep 2014 10:27:35 -0000
    +>> guyzmo <guyzmo-kMjww5mZloE at public.gmane.org> writes:
     -----BEGIN PGP SIGNATURE-----
     Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
    +iQEcBAEBAgAGBQJUGrN4AAoJENvXNx4PUvmC6LsIAIaFrd4MFnm8EixrAHPGfW6j
    +L3KNG7Dv+hQuNRUN6qn+emZHI8wX4O74HOZOpZWkE09CmjkPJBmf7IuJwtz2ONbM

* id:cover.1411379395.git.jani at nikula.org came in three times, with
  three dates, but no significant differences:

    Date: Mon, 22 Sep 2014 11:54:20 +0200
    X-List-Received-Date: Mon, 22 Sep 2014 09:54:16 -0000

    Date: Mon, 22 Sep 2014 11:54:42 +0200
    X-List-Received-Date: Mon, 22 Sep 2014 09:54:37 -0000

    Date: Mon, 22 Sep 2014 11:54:51 +0200
    X-List-Received-Date: Mon, 22 Sep 2014 09:54:49 -0000

Anyhow, I've pushed the Git archive [4,5] if anyone wants to play
around with ssoma.  I think this would be a nice backend for folks
building notmuch-based web archives, and pulling from Git is easier
than downloading a new mbox ;).

Cheers,
Trevor

[1]: http://ssoma.public-inbox.org/README
[2]: http://public-inbox.org/meta/m/ec8f54cf6451eef6e9f59eff691cd9002f4fdf65.html
[3]: http://git.tremily.us/?p=ssoma-mda.git;a=shortlog;h=refs/heads/python
     I have an uncommitted patch to work around http://bugs.python.org/issue22684
[4]: http://git.tremily.us/?p=notmuch-archives.git
[5]: git://tremily.us/notmuch-archives.git

-- 
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <http://notmuchmail.org/pipermail/notmuch/attachments/20141107/34578f0e/attachment-0001.pgp>


More information about the notmuch mailing list