Parsing mailman archive (content analysis)


#1

Hello,

Mailman produce those familiar archive http://www.metzdowd.com/pipermail/cryptography/

I want to download the .gz and analyses the textual content. I did first a script to quickly download and unzip the archive:

#!/bin/bash

# Import cryptography list

wget -r -l1 --no-parent --no-directories "http://www.metzdowd.com/pipermail/cryptography/" -P ./list-cryptography.metzdowd.com -A "*-*.txt.gz"

gunzip ./list-cryptography.metzdowd.com/*-*.txt.gz
rm ./list-cryptography.metzdowd.com/*-*.txt.gz

It’s still exploratory, so I’m not sure about the method I’m going to use to actually analyses the body of the email. But I need to parse those archive files first. I mostly want to have a way to keep conversation together. I taught I can use the In-Reply-To and the message ID but apparently not every message have a In-Reply-To header even if it’s a reply. Should I only use the subject line?

3 messages header from the list from the same tread:

From satoshi at vistomail.com  Fri Jan 16 11:03:14 2009
From: satoshi at vistomail.com (Satoshi Nakamoto)
Date: Sat, 17 Jan 2009 00:03:14 +0800
Subject: Bitcoin v0.1 released
Message-ID: <CHILKAT-MID-30c0e5a0-3435-5411-3f7b-3fe798efbe86@server123>

From hal at finney.org  Sat Jan 10 21:22:01 2009
From: hal at finney.org (Hal Finney)
Date: Sat, 10 Jan 2009 18:22:01 -0800 (PST)
Subject: Bitcoin v0.1 released
Message-ID: <20090111022201.C084C14F6E1@finney.org>

From jthorn at astro.indiana.edu  Sat Jan 17 11:49:45 2009
From: jthorn at astro.indiana.edu (Jonathan Thornburg)
Date: Sat, 17 Jan 2009 11:49:45 -0500 (EST)
Subject: Bitcoin v0.1 released
In-Reply-To: <CHILKAT-MID-30c0e5a0-3435-5411-3f7b-3fe798efbe86@server123>
References: <CHILKAT-MID-30c0e5a0-3435-5411-3f7b-3fe798efbe86@server123>
Message-ID: <alpine.BSO.1.10.0901171118280.17082@oxygen.astro.indiana.edu>

I’m also pretty sure there are already tools to group “conversation” or content before making text analysis. No? One things I want to do is to search for a keyword and if this keyword appear in one of the message, do a analysis of each message from this conversation (in other words, discard all conversation that do not have this keyword).

If you have some pointer, that would be greatly appreciated. I don’t care using python, R (I’m not really familiar with any).

A+

Simon


#2

@gagarine while your idea makes sense, I would avoid tackling it directly. First of all, if you had access to the database, you could use Mailman’s core modules, Python’s standard library mailman module, or the newer HyperKitty Django-based archiver.

Assuming that’s not an option and what you scrape is what you got, try this archive-to-mbox converter script, at which point you could use mailbox.mbox, or move it further into a more modern list. This rss converter could also be helpful.

Depending on what you’re comfortable with, once you’ve got the contents into structured form (i.e. separating subject, recipient, and other fields row-by-row) the analytics could be done with Pandas or Kibana, or anything in between. One other example: Indexing your Gmail


#3

(I reply directly from my email a day ago, but apparently that doesn’t works. So their my reply from a day ago)

Thanks a lot @oleg!

This is a way better approach. I was quickly able to convert the archive to mbox format and import the result in a mail client to check that everythings was looking right.

I saw that Python had a built in class for reading mailbox format, from their index messages in any system seem super easy. I going to do that today. Then I will only have the real hard part…. doing the analysis right ;).

I had to fix some minor problem in the importer and convert it to python3. I will post all my scripts when I’m done.


#4

So I have a working importer/converter system now

#!/bin/bash

# Download and unzip the crypotgraphiy list
wget -r -l1 --no-parent --no-directories "http://www.metzdowd.com/pipermail/cryptography/" -P ./list-cryptography.metzdowd.com -A "*-*.txt.gz"
gzip -d list-cryptography.metzdowd.com/*.txt.gz

# convert to mbox
./mailmanToMBox.py list-cryptography.metzdowd.com

# Concatanate mbox
cat list-cryptography.metzdowd.com/*.mbox > list-cryptography.metzdowd.com/all.mbox

I updated the script to python3 and stop working on zip (I choose to unzip them before with bash). I still have problem with some speciall char in file detected in UTF-8 but with actually char encoded in cp1552 it’s why I ignore encoding error… For a perfect conversion, a line by line encoding detection should be used I guess

#!/usr/bin/env python3
"""
mailmanToMBox.py:  Inserts line feeds to create mbox format from Mailman Gzip'd
Text archives decompressed
Usage:   ./to-mbox.py  dir
Where dir is a directory containing .txt files pulled from mailman Gzip'd Text and decompressed
"""
import sys
import os
import tokenize

def main():
    if len(sys.argv) !=2:
        print(__doc__)
        sys.exit()

    rootDir = sys.argv[1]
    numConv = 0
    for root, dirs, files in os.walk(rootDir):
        for fil in files:
            if(fil.find('.txt') > -1):
                inFile = os.path.join(rootDir,fil)
                outFile = inFile.replace('.txt','.mbox')
                print('Converting ',fil,' to mbox format')
                if not makeMBox(inFile,outFile):
                    print((outFile,' already exists, did not overwrite'))
                else:
                    numConv +=1
    print('Converted ' ,str(numConv),'archives to mbox format')
    

def makeMBox(fIn,fOut):
    '''
    from http://lists2.ssc.com/pipermail/linux-list/2006-February/026220.html
    '''
    if not os.path.exists(fIn):
        return False
    if os.path.exists(fOut):
        return False

    out = open(fOut,"w")

    lineNum = 0

    # detect encoding
    readsource =  open(fIn,'rb').__next__
    fInCodec = tokenize.detect_encoding(readsource)[0]

    for line in open(fIn,'rt', encoding=fInCodec, errors="replace"):
        if line.find("From ") == 0:
            if lineNum != 0:
                out.write("\n")
            lineNum +=1
            line = line.replace(" at ", "@")
        out.write(line)
        
            
    out.close()
    return True

# INIT
if __name__ == '__main__':
    main()