I've improved my Hadoop map reduce development process

September 05, 2010

I had to design a fairly complicated work flow in the last several days, and I hit upon a development approach that worked really well for me to get things written and debugged on my laptop:

I started by hand-crafting small input data sets for all input sources. I then created a quick and dirty diagram using OmniGraffle (any other diagramming tool would do) showing how I thought my multiple map reduce jobs would play together. I marked up the diagram with job names and input/output directories for each job that included sample data. Each time new output appeared, I added sample output to the diagram. I had a complicated work flow so it was tricky to keep everything on one page for reference, but the advantage of having this overview diagram is that it made it much easier to keep track of what each map reduce job in the workflow needed to do and made it easier to hand-check each job.

As I refactored my workflow by adding or deleting jobs and changing code, I took a few minutes to keep the diagram up to date - well worth it. Another technique that I find convenient is to rely on good old-fashioned make files both to run multiple jobs together on my laptop with a local Hadoop setup, and also to organize the Elastic MapReduce command lines to run on AWS.

I have been experimenting with higher level tools like Cascading and Cascalog that help manage work flows, but I decided to just write my own data source joins, etc. and organize everything as a set of individual map reduce jobs that are run in a specific order.

Search This Blog

I've improved my Hadoop map reduce development process

Comments

Post a Comment

Popular posts from this blog

Ruby Sinatra web apps with background work threads

My Dad's work with Robert Oppenheimer and Edward Teller

Time and Attention Fragmentation in Our Digital Lives

Clojure vs. Scala smackdown

Nice: OpenCyc version 4.0 has been released

Ruby Sinatra web apps with background work threads

Small example app using Ember.js and Node.js

Writing a simple SQL data source for the free LGPL version of SmartGWT

Comparing Clojure + Clojurescript with Scala + Scala.js

Happy New Year

Using the Datomic free edition in a lein based project

And the best JVM replacement language for Java is: Java?

History in the making: first Lee Sedol vs. AlphaGo match game