Tuesday, July 29, 2008

New cuil.com search site and other alternative search engines

As someone who has spent a lot of my own time experimenting with Nutch, I have long desired to create my own "niche" search site that indexed only technology sites with clustered result categories. So, I am a little envious of ex-google employees and their family/friends who (reportedly) had $30 million of venture capital to start the cuil.com search site. Although a lot of the images don't seem to match search results, cuil.com looks pretty good - I especially like the "Explore by Category" tab that works similarly to another favorite search site clusty.com. "Explore by Category" is both cool and useful!

It is interesting that new search engines can attract a lot of venture capital: with Google, Microsoft, and Yahoo all making very large investments, it must make investors nervous - but with the upside of large financial gains if any search startup gets a good fraction of the market.

Monday, July 28, 2008

Open data sources like Metaweb, Wikipedia, and SEC Edgar database

I just read a few month old blog by Toby Segaran (author of the very useful book Programming Collective Intelligence) on link information for shared board of directors members between large corporations. Many years ago I did something similar from combined CIA Factbook and SEC Edgar data and I still have a SQL dump file on my Open Source web page.

Since Toby works at Metaweb he fetched the corporate director link data from Metaweb (Freebase). Freebase sets a high standard for the ease of finding and extracting information. Other sources like Wikipedia (via custom web scraping or fetching their entire database) or the RDF extraction of Wikipedia (DBpedia) are not as simple to use, but still useful.

I have a long history of organizing and cataloging information, starting in the 1980s at SAIC. Back in the pre-gopher days, I used to maintain lists (as plain text files) of where to find useful tools and information on FTP sites on the Internet and when someone would ask me where to find something then I would grep my own lists. Things have improved a lot since then :-)

I just finished the rough draft for an article on the Semantic Web this morning. Although standards like RDF/RDFS/OWL/SPARQL are very useful, I expect the Semantic Web to also have a strong ad hoc component. However ad hoc information sources may have standard interfaces built for them (E.g., SPARQL end points, etc.)

Thursday, July 24, 2008

Dynamic language 'goodness': comparing JRuby and Java Semantic Web example programs

Although there are several Semantic Web libraries or frameworks that I like to use, I had to choose just one for a DevX article that I am finishing up. I chose to use Sesame. After covering what I think are some "big wins" of using RDF/RDFs/OWL (for some applications) I present some example programs that I hope that readers have lots of fun with. The "wrapper" library that I wrote for Sesame works fine for both Java (which Sesame is written in) and JRuby. I must say that for experimenting with Sesame, JRuby is a lot nicer because the example programs are much shorter and with Ruby duck typing it is easier to write callback handlers, etc. for my wrapper library. Being able to work interactively in a JRuby jirb shell is also a big win for experimenting with code, different SPARQL queries, etc.

Thursday, July 17, 2008

Programming for small devices

Several years ago I did a few projects for the "Java cell phone" (J2ME) platform, and had a lot of fun with that.

After recently setting up NetBeans with the Java ME CDC tools and Eclipse with the most recent Android platform tools, late last night and early this morning I installed Apple's latest developer's tools that include the iPhone SDK and Dashcode. Since I very much like my Nokia N800, I am also interested in medium resolution devices (the N800 has a good 800x480 screen).

My interest is in writing web portals that support both browsers and small devices. One option is just creating special CSS for different web browser screen sizes, and another option is rendering page view data as XML or JSON and letting rich clients provide the display and handling of forms, etc. (an option I used several years ago on a customer project).

Ideally, I would like to be able to support a wide variety of small devices without a very large investment in my time getting (back) up to speed. I have just a little experience with Objective-C and Cocoa so for the iPhone, just using Dashcode looks like a good option (for me).

Sunday, July 13, 2008

I am evaluating Google's Protocol Buffers for my knowledgebooks.com KB_bundle product

I am working on a new Java version of my knowledgebooks.com KB_bundle product (see home page for an overview) that implements an all in one toolbox for Natural Language Processing (NLP), entity extraction from text, text summarizing, text clustering, knowledge extraction to RDF/RDFS, support for document management (file management, index/search), and SPARQL querires of either embedded or external RDF data stores. KB_bundle will be free for non-commercial use and evaluation, and available for a fee for commercial use.

While I designed KB_bundle as an embedded Java library, I have always planned for both RESTful and SOAP web service support. I have been looking at Google's Protocol Buffer documentation and examples this weekend and I think that I will also supply a third wrapper for Protocol Buffer RPC support.

Earlier this year, a project that I was working on had performance problems due to the overhead of serializing data to XML and then parsing it in a REST based system. The problem was that when the project started, relatively little data was transferred between back end processes and a front end Rails application so the overhead of using XML was OK. As the project requirements changed, we passed much more data encoded in XML. I am looking at Protocol Buffer in general as a way to avoid performance problems in the future.

Saturday, July 12, 2008

OpenDS 1.0 LDAPv3 server

OpenDS 1.0 LDAP server has just been released and was easy to install, configure, and run. One thing that I especially like is that it is set up by default to run nicely in a development environment (including test data to play with) with directions for reconfiguring for production use with replication.

I used the JNLP setup file, hitting this link and accepted the standard install options (installed in my home directory in ~/OpenDS). There are test command line clients to test the installation and configuration; for example:
markw$ bin/ldapsearch --hostname localhost --port 1389 --baseDN "dc=example,dc=com" --searchScope base "(objectClass=*)"
dn: dc=example,dc=com
objectClass: domain
objectClass: top
dc: example
and then you can use JNDI APIs for Java client LDAP enabled applications. I think that Sun is going to offer good support for Glassfish + OpenDS (if they don't already). BTW, I have many years of good experiences developing on the Tomcat platform (and a little less use of JBoss) but I am becoming more enthusiastic about Glassfish, integration with NetBeans, etc. The days of consultants developing their own private set of infrastructure tools is just about over: for me, I look to either a subset of J2EE or Ruby on Rails to save development time on projects. Except for developing my own tools that are very application domain specific (usually AI, text and data mining, NLP, etc.), I prefer spending time studying and using standard frameworks, plugins, and components.