Wednesday 12 March 2014

Small computer, big data

    This post comes as one of the longest running scripts I've ever created has just finished its work. In the last week of January I set my Raspberry Pi to the task of processing 5 years of news stories into a 20Gb tree of JSON files, and here in the second week of March it's completed the task.
    Given that a PC has done the same job in a couple of days the first question anyone would ask is simply this: Why?
    My Pi runs all day every day, 24/7. It collects the news stories from RSS feeds and stores them in a MySQL database. It uses somewhere under 2 watts, and it will do this no matter what I ask it to do because it's plugged in all the time. I can touch its processor with my finger, it's not hot enough to hurt me. My laptop by comparison with its multi-core Intel processor, board full of support chips, and SATA hard disk, uses somewhere under a hundred watts. I can feel the hot air as its fan struggles to shift the heat from the heatsink. I wouldn't like to hold my finger on its processor, assuming I could get past its heat pipe.
    Thus since I'm in no hurry for the data processing it will use a lot less power and it makes more sense for me to run the script on the Pi. This isn't an exercise in using a Pi for the sake of it, instead the Pi was the most appropriate machine for the task.
    So having run a mammoth script on a tiny computer for a couple of months, how did I do it and what did I learn?
    The first thing I'd like to say is that I'm newly impressed with the robustness of Linux. I've run Linux web servers since the 1990s but I've never hammered any of my Linux boxes in quite this way. Despite stealing most of the Pi's memory and processor power with my script it kept on with its everyday tasks, fetching news stories and storing them as always. I could use its web server - a little slowly it's true -, I could use its Samba share and I could keep an eye on its MySQL server. Being impressed with this might seem odd, but I'm more used to hammering a Windows laptop in this way. I know from experience the Windows box has not been so forgiving running earlier iterations of the same script.
    If anybody else fancies hammering their Pi with a couple of months of big data, here's how I did it. The script itself was written in PHP and called from a shell within an instance of screen. This way I could connect and disconnect at will via ssh without stopping the script running. The data came from the MySQL server and was processed to a 64Gb USB Flash disk. The Flash is formatted as ext4 without journaling, this was judged to be the best combination of speed and size efficiency. An early test with a different FAT formatted drive provided a vivid demonstration of filesystem efficiency as the FAT ended up using 80% of the space after only a short period of processing.
    The bottleneck turned out to be the Flash drive, a Lexar JumpDrive. Reading and writing seems to happen in bursts, the script would run quickly for about 30s and then very slowly for the next 30s purely due to disk i/o. In future I might try the same task with USB-to-SATA hard disk, though I'd lose my power advantage.
    So would I do the same again, and how might I change my approach? I think the Pi was a success in terms of reliable unattended operation and in terms of low power usage on a machine I'd have had running anyway. But in terms of data processing efficiency it could have been a lot better. A faster disk and a faster computer - perhaps something with the Pi's power advantage but a bit more processor grunt such as the CubieBoard - would have delivered the goods more quickly for not a huge extra investment. And the operating system though reliable could probably have been improved. I used a stock Raspbian, albeit with the memory allocation for graphics reduced as low as it would go. Perhaps if I'd built an Arch image with a minimum of dross I would have seen a performance increase.
    I used a Raspberry Pi for this job because it was convenient to do so, it uses very little power and I had one that would have been powered up anyway throughout the period the script was running. The Raspberry Pi performed as well as I expected, but I can not conclude anything other than that it is not the ideal computer for this particular job. It is sometimes tempting when you are an enthusiast for a particular platform to see it as ideal for all applications, well in this case that would be folly.
    The Pi will continue to crunch the data it collects, though on a day-to-day basis as part of the collection process. In that it'll be much more suited to the task, as a cron job running in the middle of the night the extra work of a day's keyword crunching won't be noticed. And there's the value in this exercise, something that used to require a PC, a while of my time and a little bit of code has been turned into an automated process running on a £25 computer using negligible power. I think I call that a result, don't you?