Friday, April 17, 2009

Distributed computing and Ruby

At work, I've been spending a lot of time working on porting our log management and analysis system over from a dying mysql implementation to a new system build upon an open source distributed database (Hypertable) using distributed Map/Reduce jobs for a variety of summarization and analysis tasks.

I've been amazed and gratified to be able to do the vast majority of this work within the comfort of Ruby, my favorite language. While Hypertable is written in C++, it uses Thrift to provide ruby and other language access, and since the primary developers of Hypertable are at Zvents (a rails shop), they've created a Rails plugin called HyperRecord that allows us to access Hypertable almost identically to how you would access mysql with ActiveRecord.

This has resulted in the ability to make the front end application for our stats & logging infrastructure a standard Rails app. Access restrictions are somewhat different in Hypertable than in a full relational database (its easiest to think of as an ordered hash... lookups by key or for a range of keys are supported, but conditions are expensive), but for most of our developers its just another rails app to work with.

The second place where I've been amazed by how much I've been able to stick to ruby is in designing and running our batched Map/Reduce jobs. We're using a framework called Cascading for designing our scalable batch workflows, built in Java and sitting on top of Hadoop. For those who aren't familiar, Hadoop is an open source implementation of Map/Reduce, written in Java, and Cascading allows for a higher-level conceptual model for parsing, analyzing, and modifying your data using Hadoop.

Cascading provides a number of built in filtering, text processing, and arithmetic map/reduce operations built in, and thanks to the wonder that is jruby, we're able to arrange our workflows entirely within ruby using Cascading.jruby. Only when we need a special operation that can't be constructed from the built-ins do we have to dip into Java.

So if you've been itching to dip your toes in the open source distributed computing revolution, but have been reluctant due to the Java heavy nature of Hadoop, take a look at HyperTable and Cascading!

No comments:

Post a Comment