Tuesday, April 28, 2009

Using inject for improved performance

I found myself wanting to know if an array had any truth values in it, and wrote a method that I was pretty proud of:
module Enumerable
  def any_true?
    inject(false) do |truth, item|
      truth || (yield item)
    end
  end
end
So now I can write
arr.any_true? {|item| some_method(item)}
Then I realized that a simpler way of expressing this is just
arr.map {|item| some_method(item)}.any?

The latter seems a lot better... its easier to understand, and uses just ruby primitives. But the first is actually still better for most cases. Why? Performance. Imagine some_method includes a database query or two... In the simple map.any? case we'll have to evaluate it for every item. But in the any_true? case we only evaluate until we find one, at which point we're done; we'll short circuit every other check and never hit the method again.

The difference can be enormouse. If I emulate this database requirement using a method that looks like:
def random_truth(likelihood)
    sleep 0.1
    rand < likelihood
  end
I can now run some tests that demonstrate the extreme difference in runtime:
>> Benchmark.realtime {10.times { (1..100).to_a.map {|i| random_truth(0.5)}.any? }}
=> 100.000391960144
>> Benchmark.realtime {10.times { (1..100).to_a.any_true? {random_truth(0.5)} }}
=> 2.00004100799561

Friday, April 17, 2009

Distributed computing and Ruby

At work, I've been spending a lot of time working on porting our log management and analysis system over from a dying mysql implementation to a new system build upon an open source distributed database (Hypertable) using distributed Map/Reduce jobs for a variety of summarization and analysis tasks.

I've been amazed and gratified to be able to do the vast majority of this work within the comfort of Ruby, my favorite language. While Hypertable is written in C++, it uses Thrift to provide ruby and other language access, and since the primary developers of Hypertable are at Zvents (a rails shop), they've created a Rails plugin called HyperRecord that allows us to access Hypertable almost identically to how you would access mysql with ActiveRecord.

This has resulted in the ability to make the front end application for our stats & logging infrastructure a standard Rails app. Access restrictions are somewhat different in Hypertable than in a full relational database (its easiest to think of as an ordered hash... lookups by key or for a range of keys are supported, but conditions are expensive), but for most of our developers its just another rails app to work with.

The second place where I've been amazed by how much I've been able to stick to ruby is in designing and running our batched Map/Reduce jobs. We're using a framework called Cascading for designing our scalable batch workflows, built in Java and sitting on top of Hadoop. For those who aren't familiar, Hadoop is an open source implementation of Map/Reduce, written in Java, and Cascading allows for a higher-level conceptual model for parsing, analyzing, and modifying your data using Hadoop.

Cascading provides a number of built in filtering, text processing, and arithmetic map/reduce operations built in, and thanks to the wonder that is jruby, we're able to arrange our workflows entirely within ruby using Cascading.jruby. Only when we need a special operation that can't be constructed from the built-ins do we have to dip into Java.

So if you've been itching to dip your toes in the open source distributed computing revolution, but have been reluctant due to the Java heavy nature of Hadoop, take a look at HyperTable and Cascading!