Don't repeat yourself: for code, sure, but how about for data?

November 09, 2008

In individual applications we want to make sure that we don't have replicated code that is identical or very similar. What about replication across all projects, both active and in "freeze mode"?

I periodically like to consolidate source code: keep a single latest svn trunk version on my system and organize the code that I have written and frequently reuse into libraries. I am in the process of packaging up most of my Ruby code in local gems.

I also have issues with many copies of textual data files. For data used in Java libraries and applications the solution is simple: I keep data with the code that needs it in JAR files that are kept in a single library directory on my development system. I have been doing this for over 10 years and this is a really nice way to keep data assets and code together.

Sometimes I simply link data statically into compiled applications that I use (e.g., in the last year I have reimplemented many of my statistical NLP tools in Gambit-C Scheme and I generate a single command line utility program with all the required data statically lined.)

For data assets used in programs developed in multiple programing languages, a "separation of concerns" between code and data assets makes more sense.

I need to better organize other data assets like tagged training data, raw text organized into a hierarchy of categories, data that I have culled form the web and stored in XML files, etc. I am starting the process of putting the most up to date versions into a single directory and tweaking my code to check the DATA environment variable value and then load data assets as-needed. I will probably not import this data directory into svn or git: most of the data seldom changes and some of the assets are huge.

Search This Blog

Don't repeat yourself: for code, sure, but how about for data?

Comments

Post a Comment

Popular posts from this blog

I am moving back to the Google platform, less excited by what Apple is offering

Getting closer to AGI? Google's NoteBookLM and Replit's AI Coding Agent

My Dad's work with Robert Oppenheimer and Edward Teller

Clojure vs. Scala smackdown

Nice: OpenCyc version 4.0 has been released

Ruby Sinatra web apps with background work threads

Small example app using Ember.js and Node.js

Writing a simple SQL data source for the free LGPL version of SmartGWT

Using the Datomic free edition in a lein based project

Comparing Clojure + Clojurescript with Scala + Scala.js

Happy New Year

And the best JVM replacement language for Java is: Java?

History in the making: first Lee Sedol vs. AlphaGo match game