Wednesday, July 28, 2010

Big Data

Since I have been working for CompassLabs I have been getting even more appreciation for just how value there is in data. This article in the New York Times also makes the business case for data mining.

My first real taste for the power of data came about 10 years ago when I worked for WebMind. We looked at data from online financial discusion groups, SEC Edgar data, etc. to try to value stocks based on sentiment analysis (text mining) and raw data mining.

Thursday, July 22, 2010

Haskell is much easier after coding in Clojure and Scala for a while

I got very excited by Haskell about 18 months ago and spent a fair amount of time kicking the tires and reading the Haskell book. That said, my interest waned and I moved on to other things (mostly Ruby development with some Clojure, less Scala).

When I noticed the the recent new Haskell release for July 2010 I installed it and started working through Miran Lipovańća's online book Learn You a Haskell for Great Good!. This time, things seem to "just click" and Haskell's functional style seems very natural. I have real regrets that I probably won't be using Haskell much because I mostly code in what people pay me to use which in the last 5 years has been Lisp, Ruby, Java, and Clojure.

Tuesday, July 20, 2010

Interesting new Google Buzz API: PubSubHubbub firehose

I spent some time experimenting with the Buzz APIs this morning - well documented and simple to use. The firehose data will be useful for tracking public social media posts.

I set up Google's example app on my AppEngine account and had fun playing with it. Unfortunately, because of the amount of incoming data, it would only run each day for about 4 or 5 hours before hitting the free resource quota limits. Since this was just for fun, I didn't feel like paying for additional resources.

Friday, July 16, 2010

Good news: Google buying Freebase

That is very cool, I think.

I have lately been waist deep in using Freebase for customer work.

While there is a lot of cruft data in Freebase, with some manual effort and some automation, it is a good source of a wide variety of information. Depending on application, DBpedia and GeoNames are other good resources for structured data.

I have a fair amount of example code for Freebase, DBpedia, and GeoNames in my latest book (there is a free PDF download on my open content web page, or you can buy a copy at

Wednesday, July 14, 2010

Scala 2.8 final released. I updated my latest book's Scala examples

Good news, Scala 2.8 has been released.

I updated the github repo for the code examples for my book Practical Semantic Web and Linked Data Applications. Java, Scala, Clojure, and JRuby Edition (links for free PDF download and print book purchase).

I haven't had the opportunity to do very much coding in Scala for several months because the company I have been working for (CompassLabs) is mostly a Clojure and Java shop. That said, Scala is a great language it is good to see the final release of 2.8 with the new collections library and other changes.

Good job: CouchDB version 1.0

I usually use PostgreSQL and MongoDB (and sometimes RDF data stores) for my data store needs, but I have spent a lot of time in the last couple of years experimenting with CouchDB and always keep it handy on my laptop and one of my servers. I was happy to upgrade to version 1.0 today!

Sunday, July 11, 2010

Monetizing social graphs

Interesting news this morning of Google's investment in online games 800 pound gorilla Zynga in order to have access to social graph data from people logging into Google accounts to play games.

There has been a lot of buzz about Facebook's effective social graph data and games like those provided by Zynga have helped them. That said, I would still bet on Google having a better chance of making the most money off of social graphs because they get to effectively combine data from at least five sources to build accurate user profiles: statistical NLP analysis of GMail, search terms used by people who are logged in to any Google services, friends and business connections from GMail address books, social connections from Google Buzz (which often includes data from other social graphs like Twitter), and in the near future online multi-player gaming.

There is another issue: infrastructure. While I am willing to roughly equate the capabilities for non-realtime analytics of very large Hadoop clusters and Google's internal (original) MapReduce infrastructure, I would bet that Facebook will have problems with their mixture of highly sharded MySQL, massive use of memcached, and some use of Cassandra for their live systems. At least to me, Goggle's infrastructure is the most interesting aspect of the company. Facebook has awesome infrastructure, but Google's is even more so.

Saturday, July 10, 2010

Using Open Graph

The Open Graph Protocol is a reasonable new ad-hoc standard for adding semantic content to web sites. Open Graph got a large boost when Facebook starting using it to encourage easy linking of preferences, etc. to better model users to increase advertising revenues (documentation). Freebase also has an Open Graph interface. For example, you can look me up on Freebase using (leave off the "?html=1" to get a JSON response):
"id" : "0b6_g82",
"username" : "mark_louis_watson",
"name" : "Mark Louis Watson",
"link" : "",
"embed" : "",
"picture" : "",
"date_of_birth" : "1951",
"nationality" : "United States of America",
"gender" : "Male",
"profession" : "Author",
"metadata" : {
"connections" : {
"spouses" : "",
"books" : ""
You can get my Facebook information using my user name with the graph API: Adding required OpenGraph markup to your web pages is simple enough; for example, I added the following to my consulting web page:
    <meta property="og:title" content="Mark Watson: Ruby and Java Consultant" /> 
<meta property="og:type" content="consulting" />
<meta property="og:url" content="" />
<meta property="og:image" content="" />
There are other per-page attributes that you can optionally set.

There is a nice web app on Heroku that lets you check your meta data; for example: for my consulting page.

How does Open Graph compare to RDFa? RDFa is a much richer notation for at least two reasons: semantic markup can refer to elements on a web page and not the entire page, and you can use any Ontology in RDFa. Still, in my opinion, any semantic markup is a good thing.

For most web applications you really want to add Open Graph and RDFa markup automatically for data driven web pages. You might as well use both because even though it takes some effort, providing semantic markup may have a long tail of benefits.

Sunday, July 04, 2010

Reading two good books on using MapReduce algorithms for large scale text processing

I have a fair amount of experience with Hadoop, but little experience with associated tools like Pig and Mahout. I can spend more time with Pig in my local sandbox but I wanted more formal help getting up to speed with Mahout and general MapReduce application programming. I purchased the MEAP for Mahout In Action, reading new chapters as they are available. The authors (especially Robin Anil) have been very helpful on the online forum for the book, and I have found the material to be useful and interesting.

Another book I bought was just delivered yesterday morning: Data-Intensive Text Processing with MapReduce. I have only read the first few chapters but the book has been very interesting and informative.

I have done some work based on Hadoop for about half the customers I have had in the last year and a half, and I believe that knowing how to horizontally scale out machine learning and text analytics applications has become a must-have skill.