Showing posts from July, 2010

Big Data

Since I have been working for CompassLabs I have been getting even more appreciation for just how value there is in data. This article in the New York Times also makes the business case for data mining. My first real taste for the power of data came about 10 years ago when I worked for WebMind. We looked at data from online financial discusion groups, SEC Edgar data , etc. to try to value stocks based on sentiment analysis (text mining) and raw data mining.

Haskell is much easier after coding in Clojure and Scala for a while

I got very excited by Haskell about 18 months ago and spent a fair amount of time kicking the tires and reading the Haskell book. That said, my interest waned and I moved on to other things (mostly Ruby development with some Clojure, less Scala). When I noticed the the recent new Haskell release for July 2010 I installed it and started working through Miran Lipovańća's online book Learn You a Haskell for Great Good! . This time, things seem to "just click" and Haskell's functional style seems very natural. I have real regrets that I probably won't be using Haskell much because I mostly code in what people pay me to use which in the last 5 years has been Lisp, Ruby, Java, and Clojure.

Interesting new Google Buzz API: PubSubHubbub firehose

I spent some time experimenting with the Buzz APIs this morning - well documented and simple to use. The firehose data will be useful for tracking public social media posts. I set up Google's example app on my AppEngine account and had fun playing with it. Unfortunately, because of the amount of incoming data, it would only run each day for about 4 or 5 hours before hitting the free resource quota limits. Since this was just for fun, I didn't feel like paying for additional resources.

Good news: Google buying Freebase

That is very cool, I think. I have lately been waist deep in using Freebase for customer work. While there is a lot of cruft data in Freebase, with some manual effort and some automation, it is a good source of a wide variety of information. Depending on application, DBpedia and GeoNames are other good resources for structured data. I have a fair amount of example code for Freebase, DBpedia, and GeoNames in my latest book (there is a free PDF download on my open content web page, or you can buy a copy at

Scala 2.8 final released. I updated my latest book's Scala examples

Good news, Scala 2.8 has been released. I updated the github repo for the code examples for my book Practical Semantic Web and Linked Data Applications. Java, Scala, Clojure, and JRuby Edition (links for free PDF download and print book purchase). I haven't had the opportunity to do very much coding in Scala for several months because the company I have been working for ( CompassLabs ) is mostly a Clojure and Java shop. That said, Scala is a great language it is good to see the final release of 2.8 with the new collections library and other changes.

Good job: CouchDB version 1.0

I usually use PostgreSQL and MongoDB (and sometimes RDF data stores) for my data store needs, but I have spent a lot of time in the last couple of years experimenting with CouchDB and always keep it handy on my laptop and one of my servers. I was happy to upgrade to version 1.0 today!

Monetizing social graphs

Interesting news this morning of Google's investment in online games 800 pound gorilla Zynga in order to have access to social graph data from people logging into Google accounts to play games. There has been a lot of buzz about Facebook's effective social graph data and games like those provided by Zynga have helped them. That said, I would still bet on Google having a better chance of making the most money off of social graphs because they get to effectively combine data from at least five sources to build accurate user profiles: statistical NLP analysis of GMail, search terms used by people who are logged in to any Google services, friends and business connections from GMail address books, social connections from Google Buzz (which often includes data from other social graphs like Twitter), and in the near future online multi-player gaming. There is another issue: infrastructure. While I am willing to roughly equate the capabilities for non-realtime analytics of very large

Using Open Graph

The Open Graph Protocol is a reasonable new ad-hoc standard for adding semantic content to web sites. Open Graph got a large boost when Facebook starting using it to encourage easy linking of preferences, etc. to better model users to increase advertising revenues ( documentation ). Freebase also has an Open Graph interface. For example, you can look me up on Freebase using (leave off the "?html=1" to get a JSON response): { "id" : "0b6_g82", "username" : "mark_louis_watson", "name" : "Mark Louis Watson", "link" : "", "embed" : "", "picture" : "", "date_of_birth" : "1951", "nationality" : "

Reading two good books on using MapReduce algorithms for large scale text processing

I have a fair amount of experience with Hadoop, but little experience with associated tools like Pig and Mahout. I can spend more time with Pig in my local sandbox but I wanted more formal help getting up to speed with Mahout and general MapReduce application programming. I purchased the MEAP for Mahout In Action , reading new chapters as they are available. The authors (especially Robin Anil) have been very helpful on the online forum for the book, and I have found the material to be useful and interesting. Another book I bought was just delivered yesterday morning: Data-Intensive Text Processing with MapReduce . I have only read the first few chapters but the book has been very interesting and informative. I have done some work based on Hadoop for about half the customers I have had in the last year and a half, and I believe that knowing how to horizontally scale out machine learning and text analytics applications has become a must-have skill.