Using Hadoop for analyzing social network data

September 11, 2010

At CompassLabs my colleague Vivek and I are using Hadoop and Amazon's Elastic MapReduce to process social network data. I can't talk about what we are doing except to say that it is cool.

I blogged last week about taking the time to create a one-page diagram showing all map-reduce steps and data flow (with examples showing data snippets): this really helps manage complexity. I have a few other techniques that I have found useful enough to share:

Take the time to setup a good development environment. Almost all of my map-reduce applications are written in either Ruby or Java (with a few experiments in Clojure and Python). I like to create Makefiles to quickly run multiple map-reduce jobs in a workflow on my laptop. For small development data sets, after editing source code, I can run a work flow and be looking at output in about 10 seconds for Ruby, a little longer for Java apps. Complex work flows are difficult to write and debug so get comfortable with your development environment. My Makefiles build local JAR files (if I am using Java), copy map-reduce code and test data to my local Hadoop installation, remove the output directories, run the jobs in sequence, and optionally open the outputs for each job step in a text editor.

Take advantage of Amazon's Elastic MapReduce. I just have limited experience setting up and using custom multi-server clusters because for my own needs and so far for work for two customers Elastic MapReduce has provided good value and saved a lot of setup time and administration time. I think that you really need to get to certain large scale of operations before it makes sense to maintain your own large Hadoop cluster.

Comments

Alex Ott11:43 AM
You can look to fork of clojure-hadoop (http://github.com/alexott/clojure-hadoop) - it supports more functions than original clojure-hadoop, and more work in progress. If you have ideas, what could be added to it, please file issue or just write to me
ReplyDelete
Replies
Mark Watson, author and consultant12:07 PM
Hello Alex,

I experimented with the original clojure-hadoop project and I just did a git clone on your fork. I am super busy though so I may not be able to get to it for a while.

Thanks,
Mark
ReplyDelete
Replies

Add comment

Search This Blog

Using Hadoop for analyzing social network data

Comments

Post a Comment

Popular posts from this blog

I am moving back to the Google platform, less excited by what Apple is offering

Getting closer to AGI? Google's NoteBookLM and Replit's AI Coding Agent

My Dad's work with Robert Oppenheimer and Edward Teller

Clojure vs. Scala smackdown

Nice: OpenCyc version 4.0 has been released

Ruby Sinatra web apps with background work threads

Small example app using Ember.js and Node.js

Writing a simple SQL data source for the free LGPL version of SmartGWT

Using the Datomic free edition in a lein based project

Comparing Clojure + Clojurescript with Scala + Scala.js

Happy New Year

And the best JVM replacement language for Java is: Java?

History in the making: first Lee Sedol vs. AlphaGo match game