Using Hadoop for analyzing social network data

September 11, 2010

At CompassLabs my colleague Vivek and I are using Hadoop and Amazon's Elastic MapReduce to process social network data. I can't talk about what we are doing except to say that it is cool.

I blogged last week about taking the time to create a one-page diagram showing all map-reduce steps and data flow (with examples showing data snippets): this really helps manage complexity. I have a few other techniques that I have found useful enough to share:

Take the time to setup a good development environment. Almost all of my map-reduce applications are written in either Ruby or Java (with a few experiments in Clojure and Python). I like to create Makefiles to quickly run multiple map-reduce jobs in a workflow on my laptop. For small development data sets, after editing source code, I can run a work flow and be looking at output in about 10 seconds for Ruby, a little longer for Java apps. Complex work flows are difficult to write and debug so get comfortable with your development environment. My Makefiles build local JAR files (if I am using Java), copy map-reduce code and test data to my local Hadoop installation, remove the output directories, run the jobs in sequence, and optionally open the outputs for each job step in a text editor.

Take advantage of Amazon's Elastic MapReduce. I just have limited experience setting up and using custom multi-server clusters because for my own needs and so far for work for two customers Elastic MapReduce has provided good value and saved a lot of setup time and administration time. I think that you really need to get to certain large scale of operations before it makes sense to maintain your own large Hadoop cluster.

Comments

Alex Ott11:43 AM
You can look to fork of clojure-hadoop (http://github.com/alexott/clojure-hadoop) - it supports more functions than original clojure-hadoop, and more work in progress. If you have ideas, what could be added to it, please file issue or just write to me
ReplyDelete
Replies
Mark Watson, author and consultant12:07 PM
Hello Alex,

I experimented with the original clojure-hadoop project and I just did a git clone on your fork. I am super busy though so I may not be able to get to it for a while.

Thanks,
Mark
ReplyDelete
Replies

Add comment

Search This Blog

Using Hadoop for analyzing social network data

Comments

Post a Comment

Popular posts from this blog

I am moving back to the Google platform, less excited by what Apple is offering

Getting closer to AGI? Google's NoteBookLM and Replit's AI Coding Agent

Topics: Recipe: Mark’s African Stew, and converting my Clojure CookingSpace web site to JavaScript

Clojure vs. Scala smackdown

Nice: OpenCyc version 4.0 has been released

Ruby Sinatra web apps with background work threads

Small example app using Ember.js and Node.js

Writing a simple SQL data source for the free LGPL version of SmartGWT

Using the Datomic free edition in a lein based project

Comparing Clojure + Clojurescript with Scala + Scala.js

And the best JVM replacement language for Java is: Java?

Happy New Year

History in the making: first Lee Sedol vs. AlphaGo match game