Google Research's wiki-links data set

March 09, 2013

wiki-links was created using Google's web crawl and looking for back links to Wikipedia articles. The complete data set less than 2 gigabytes in size, so this playing with the data is "laptop friendly."

The data looks like:

MENTION vacuum tubes 10838 http://en.wikipedia.org/wiki/Vacuum_tube
MENTION electron gun 598  http://en.wikipedia.org/wiki/Electron_gun
MENTION oscilloscope 1307 http://en.wikipedia.org/wiki/Oscilloscope
MENTION radar        1657 http://en.wikipedia.org/wiki/Radar

One possible use for this data might be to compare two (possibly multiple word) terms by looking up their Wikipedia pages, remove the stop (noise words) from both pages, and calculate a similarity based on "bag of words", etc. Looks like a great resource!

Another great data set from Google for people interested in NLP (natural language processing) is the Google ngram data set that has ngram sets for "n" in the range [1,5]. This data set is huge and not "laptop friendly" so last year I leased very large memory server from hetzner.de for a few months while I used the ngram data sets. I wish that I still had this data online but the cost of the server eventually became greater than the value of ready access to the data. The next time I need it I am planning on configuring a large memory EC2 instance with enough EBS storage for the data, indices, and application specific stuff - then I can stop the large memory instance when I don't need the data online which is probably 99% of the time: most of the costs will just be for the EBS storage itself, and not the (approximately) $0.50/hour when I keep the instance running.

Edit: I just did the math: renting a Hetzner server turns out to be much less expensive than using an EC2 instance that is usually spun down because 1 terabyte of EBS storage is $100/month (almost double what a Hetzner server costs).

Search This Blog

Google Research's wiki-links data set

Comments

Post a Comment

Popular posts from this blog

I am moving back to the Google platform, less excited by what Apple is offering

My Dad's work with Robert Oppenheimer and Edward Teller

Time and Attention Fragmentation in Our Digital Lives

Clojure vs. Scala smackdown

Nice: OpenCyc version 4.0 has been released

Ruby Sinatra web apps with background work threads

Small example app using Ember.js and Node.js

Writing a simple SQL data source for the free LGPL version of SmartGWT

Using the Datomic free edition in a lein based project

Comparing Clojure + Clojurescript with Scala + Scala.js

Happy New Year

And the best JVM replacement language for Java is: Java?

History in the making: first Lee Sedol vs. AlphaGo match game