Apache Log Analysis with Haskell and Python

#Python #Web #Haskell

I’ve been planning to learn the haskell programming language for quite some time, but I was waiting for the right opportunity: a small, relatively easy project where I need some speed.

Today I thought I had finally found it: for several years I’ve been making custom statistics from the apache logs of one of my servers with a (quick and dirty) python script, and as the logs grew bigger and bigger, my script went awfully slow and memory hungry - to the point where it would not run any more.

So today I considered rewriting it in haskell. But actually my first tests show very surprising results.

The haskell script

I found an article on the internet with a simple haskell script for counting the number of unique IPs in a set of Apache logs:

module Main where
import qualified Data.Set as S
import Control.Applicative

parseIP = head . words
calculate = S.size . S.fromList . map parseIP . lines
main = print =<< calculate <$> getContents

That might be a good starting point. I’ve got a bunch of test logs with almost 1.8 million lines:

$ time cat test-logs/* | wc -l
1773140
cat test-logs/*  0.03s user 0.30s system 84% cpu 0.391 total
wc -l  0.13s user 0.15s system 70% cpu 0.390 total

So let’s test the haskell unique IP counting:

$ time cat test-logs/* | ./uniqueip 
107408
cat test-logs/*  0.14s user 1.37s system 5% cpu 28.497 total
./uniqueip  27.81s user 0.65s system 99% cpu 28.528 total

Wow! more than 28 seconds… I must say I was expecting something quicker, but let’s try python. It should be much slower, shouldn’t it?

The python script

My first surprise was that my python version was even shorter than the haskell version:

import sys
ips = {l.split()[0] for l in sys.stdin}
print(len(ips))

My second surprise was…

$ time cat test-logs/* | python uniqueip.py 
107408
cat test-logs/*  0.07s user 0.67s system 13% cpu 5.618 total
python uniqueip.py  5.46s user 0.17s system 99% cpu 5.642 total

Less than 6 secs… which is about 5 times quicker than the haskell script!

By the way, even the good old UNIX way is somewhat slower than the python version:

$ time cat test-logs/* | cut -d ' ' -f 1 | sort | uniq | wc -l
107408
cat test-logs/*  0.05s user 0.42s system 7% cpu 6.460 total
cut -d ' ' -f 1  1.40s user 0.18s system 24% cpu 6.459 total
sort  5.96s user 0.06s system 80% cpu 7.507 total
uniq  0.28s user 0.01s system 3% cpu 7.506 total
wc -l  0.00s user 0.02s system 0% cpu 7.505 total

(Note that this timing is still very impressive as it includes sorting…)

Oh, and before I forget, this was with python 3. Let’s try with python 2.7:

$ time cat test-logs/* | python2 uniqueip.py
107408
cat test-logs/*  0.04s user 0.35s system 10% cpu 3.577 total
python2 uniqueip.py  3.39s user 0.19s system 99% cpu 3.600 total

… and with PyPy:

$ time cat test-logs/* | pypy uniqueip.py
107408
cat test-logs/*  0.03s user 0.40s system 13% cpu 3.142 total
pypy uniqueip.py  2.95s user 0.18s system 99% cpu 3.145 total

So by chosing the right python interpreter we end up with something about 9 times quicker than the haskell version (and this includes the compiling time, by the way)!

Conclusion

I’m not saying that python is quicker than haskell! That would of course be pure nonsense. Probably the haskell script could be rewritten to be much quicker.

But once again I must say I’m quite impressed by the speed of a pure python script when properly written.

And the lesson of the day is that rewriting a python script in haskell will not automatically improve the execution speed - this might require much deeper work. Anyway, this will not stop me from learning haskell - it’s such a great language!

UPDATE Sept 23, 2014, after quite a few comments on twitter

UPDATE 2, a few hours later:

Thanks to @lucasdicioccio, I’ve got a better haskell code:

module Main where

import qualified Data.HashSet as S
import qualified Data.ByteString.Char8 as B
import Control.Applicative

parseIP = head . B.words
calculate = S.size . S.fromList . map parseIP . B.lines
main = print =<< calculate <$> B.getContents

Wow! The code runs in about 1.6 seconds, which is much, much better than the previous version!

That’s actually almost the fastest version, although it is slightly slower than the optimized python version just above (about 1.2 seconds with python 2.7).

To be fair, the python optimization is a bit unnatural, though, as it is only meaningful if you don’t need the rest of the log line (after the IP number).

Thanks again @lucasdicioccio, for providing this corrected code and saving haskell’s reputation ;-)