Mailman produce those familiar archive http://www.metzdowd.com/pipermail/cryptography/
I want to download the .gz and analyses the textual content. I did first a script to quickly download and unzip the archive:
#!/bin/bash # Import cryptography list wget -r -l1 --no-parent --no-directories "http://www.metzdowd.com/pipermail/cryptography/" -P ./list-cryptography.metzdowd.com -A "*-*.txt.gz" gunzip ./list-cryptography.metzdowd.com/*-*.txt.gz rm ./list-cryptography.metzdowd.com/*-*.txt.gz
It’s still exploratory, so I’m not sure about the method I’m going to use to actually analyses the body of the email. But I need to parse those archive files first. I mostly want to have a way to keep conversation together. I taught I can use the In-Reply-To and the message ID but apparently not every message have a In-Reply-To header even if it’s a reply. Should I only use the subject line?
3 messages header from the list from the same tread:
From satoshi at vistomail.com Fri Jan 16 11:03:14 2009 From: satoshi at vistomail.com (Satoshi Nakamoto) Date: Sat, 17 Jan 2009 00:03:14 +0800 Subject: Bitcoin v0.1 released Message-ID: <CHILKAT-MID-30c0e5a0-3435-5411-3f7b-3fe798efbe86@server123>
From hal at finney.org Sat Jan 10 21:22:01 2009 From: hal at finney.org (Hal Finney) Date: Sat, 10 Jan 2009 18:22:01 -0800 (PST) Subject: Bitcoin v0.1 released Message-ID: <20090111022201.C084C14F6E1@finney.org>
From jthorn at astro.indiana.edu Sat Jan 17 11:49:45 2009 From: jthorn at astro.indiana.edu (Jonathan Thornburg) Date: Sat, 17 Jan 2009 11:49:45 -0500 (EST) Subject: Bitcoin v0.1 released In-Reply-To: <CHILKAT-MID-30c0e5a0-3435-5411-3f7b-3fe798efbe86@server123> References: <CHILKAT-MID-30c0e5a0-3435-5411-3f7b-3fe798efbe86@server123> Message-ID: <alpine.BSO.email@example.com>
I’m also pretty sure there are already tools to group “conversation” or content before making text analysis. No? One things I want to do is to search for a keyword and if this keyword appear in one of the message, do a analysis of each message from this conversation (in other words, discard all conversation that do not have this keyword).
If you have some pointer, that would be greatly appreciated. I don’t care using python, R (I’m not really familiar with any).