I am very pleased to be helping the Common Crawl Organization

July 07, 2014

Originally published January 13, 2014

I am setting aside some of my time to volunteer helping out with the CommonCrawl.org

Much of the information in the world is now digitized and on the web. Search engines allow people to have a tiny view of the web, sort of like shining a low powered flashlight around in the forest at night. The Common Crawl provides the data from billions of web sites as compressed web archive files in Amazon S3 storage and thus allows individuals and organizations to inexpensively access much of the web for whatever information they need - like turning the lights on :-)

The crawl is now in a different file format. My first project is working on programming examples and how-to material for using this new format.

Search This Blog

I am very pleased to be helping the Common Crawl Organization

Comments

Post a Comment

Popular posts from this blog

I am moving back to the Google platform, less excited by what Apple is offering

AI update: The new Deepseek-R1 reasoning language model, Bytedance's Trae IDE, and my new book

Getting closer to AGI? Google's NoteBookLM and Replit's AI Coding Agent

Clojure vs. Scala smackdown

Nice: OpenCyc version 4.0 has been released

Ruby Sinatra web apps with background work threads

Small example app using Ember.js and Node.js

Writing a simple SQL data source for the free LGPL version of SmartGWT

Using the Datomic free edition in a lein based project

Comparing Clojure + Clojurescript with Scala + Scala.js

And the best JVM replacement language for Java is: Java?

Happy New Year

History in the making: first Lee Sedol vs. AlphaGo match game