Reading two good books on using MapReduce algorithms for large scale text processing

I have a fair amount of experience with Hadoop, but little experience with associated tools like Pig and Mahout. I can spend more time with Pig in my local sandbox but I wanted more formal help getting up to speed with Mahout and general MapReduce application programming. I purchased the MEAP for Mahout In Action, reading new chapters as they are available. The authors (especially Robin Anil) have been very helpful on the online forum for the book, and I have found the material to be useful and interesting.

Another book I bought was just delivered yesterday morning: Data-Intensive Text Processing with MapReduce. I have only read the first few chapters but the book has been very interesting and informative.

I have done some work based on Hadoop for about half the customers I have had in the last year and a half, and I believe that knowing how to horizontally scale out machine learning and text analytics applications has become a must-have skill.


  1. last beta of Data-Intensive Text Processing with MapReduce is also available online at

  2. Mark, have you heard about Cascalog - ? It built with Clojure on top of Cascading and allows easily write queries against data in Hadoop

    P.S. I'm currently experimenting with same technologies - mahout, etc. ;-)

  3. I have looked at Cascalog, but not actually tried it.


Post a Comment

Popular posts from this blog

Custom built SBCL and using spaCy and TensorFlow in Common Lisp

I have tried to take advantage of extra time during the COVID-19 pandemic

GANs and other deep learning models for cooking recipes