Monday, February 25, 2013

Building custom data stores

Creating a custom datastore may seem like a bad idea when such great tools like Postgres, MongoDB, CouchDB, etc. are available in their open source goodness as well as good commercial products such as Datomic, AllegroGraph, Stardog, etc. Still, frustration of not having just what I needed for a project (more on requirements later) convinced me to spend some time building my own datastore based on some available open source libraries.

Much of the motivation for my work developing is to make possible the development of a larger turnkey information appliance. I have been using MongoDB for this, but even with an application specific wrapper MongoDB has been a little awkward for my requirements, which are:

  • I want a reasonably efficient document store that supports the usual CRUD operations on arbitrary Clojure maps (which can be nested to any depth). Clojure maps are basically what I use to contain and use data so I wanted a datastore that supports this, simply.
  • I want all text in documents (embedded at any depth in the document) to be searchable.
  • I need to be able to annotate data stored documents and sometimes relationships between documents.
  • My preferred notation for annotating data is RDF
  • I need to be able to efficiently perform SPARQL queries on the RDF annotations.
  • Coupling between documents and RDF: auto delete of any triples referencing a document ID, if the referenced document is deleted.

Initially I was going to write a wrapper library using two datastores as SaaS products: Cloudant (for CouchDB with Lucene indexing) and (for a RDF datastore, with extras). A small wrapper API would have made this all work but since a lot of what I am doing is in the experimenting phase I decided that I didn't want to use remote web services for coding experiments. Using these services, with a wrapper would be nice for production, but not for hacking.

Anyway, I have built a small project that uses HSQLDB (relational database) and Sesame (RDF :

EDIT: Patrick Logan asked about my use of HSQLDB; not specific to HSQLDB really, but here is the important code (hand edited to try to get it to fit on this web page) for adding documents that are nested maps, indexing them, and searching (note: I usually use Clucy/Lucene for search in Clojure code, but for what I am doing right now, this suffices):

(defn index-if-str [x id]
  (if (= (class x) java.lang.String)
    (sql/with-connection hsql-db
      (doseq [token (map (fn [s] (.toLowerCase s))
                     (clojure.string/split x #"[ ;.,]()"))]
        (if token
          (sql/insert-record "search" {:doc_id id :word token}))))))

(defn insert-doc [map]
  (let [id
        (:id (sql/with-connection hsql-db
                 "docs" {:json (json/write-str map)})))]
    (postwalk (fn [x] (index-if-str x id)) map)

;; (insert-doc {:foo "bar" :i 101 :name "sue jones"})

(defn search [s]
    (let [indices
            (let [tokens
                  (apply str (interpose ", "
                     (map (fn [s] (str "'" (.toLowerCase s) "'"))
                       (clojure.string/split s #"[ ;.,]()"))))]
              (sql/with-connection hsql-db
                 (sql/with-query-results results
                    [(str "select * from search where word in (" tokens ")")]
                    (into [] results)))))]
      (sort (fn [a b] (compare (second b) (second a))) (into [] (frequencies indices))))))

Friday, February 15, 2013

Using the Microsoft Translation APIs from Java, Clojure, and JRuby

I wrote last July about my small bit of code on github that wrapped the Microsoft Bing Search APIs. I recently extended this to also wrap the Translation APIs using the open source project microsoft-translator-java-api project on Google Code. I just provide a little wrapper for the microsoft-translator-java-api project and if you are working in Java you should just use their library directly.

Hopefully this will save you some time if you need to use the translation services. The free tier for the translation services is currently 2 million characters translated per month.

Saturday, February 02, 2013

Goodness of micro frameworks and libraries

I spent 10+ years using large frameworks, mainly J2EE and Ruby on Rails. A large framework is a community and set of tools that really frames our working lives. I have received lots of value and success from J2EE and Rails but in the last few years I have grown to prefer micro frameworks line Sinatra (Ruby) and Compojure + Noir + Hiccup (Clojure).

Practitioners who have mastered one of the larger frameworks like Rails, J2EE, Spring, etc. can sometimes impressively and quickly prototype and then build large functioning systems. I had an odd thought this morning, and the more I mull it over, the more it makes sense to me: large frameworks seem to be optimized for consultants and consulting companies for the quick kill: get in, build the most impressive system possible with the minimum resources, and leave after finishing a successful project. This is an oversimplification, but seems to be true in many cases.

The flip side to the initial productivity of large frameworks is a very real "tax" in long term maintenance because there are so many interrelated components in a system - components that might not be used or weakly used.

Micro frameworks are designed to do one or just a few things well and other third party libraires and plugins need to be chosen and integrated. This does take extra time but then the overall codebase with dependencies is smaller and focussed (mostly) on just what is required. I view using micro frameworks and libraries only, and composing systems into distinct services as a longer term strategy to reduce resource costs for systems long term.

I still invest a fair amount of time tracking larger frameworks, recently being especially interested in what is available in Rails 4 and the latest SmartGWT (a nice framework for writing both web client and server side code in Java - lots of functionality, but not as great for quick agile development in my opinion).