Don't repeat yourself: for code, sure, but how about for data?
I periodically like to consolidate source code: keep a single latest svn trunk version on my system and organize the code that I have written and frequently reuse into libraries. I am in the process of packaging up most of my Ruby code in local gems.
I also have issues with many copies of textual data files. For data used in Java libraries and applications the solution is simple: I keep data with the code that needs it in JAR files that are kept in a single library directory on my development system. I have been doing this for over 10 years and this is a really nice way to keep data assets and code together.
Sometimes I simply link data statically into compiled applications that I use (e.g., in the last year I have reimplemented many of my statistical NLP tools in Gambit-C Scheme and I generate a single command line utility program with all the required data statically lined.)
For data assets used in programs developed in multiple programing languages, a "separation of concerns" between code and data assets makes more sense.
I need to better organize other data assets like tagged training data, raw text organized into a hierarchy of categories, data that I have culled form the web and stored in XML files, etc. I am starting the process of putting the most up to date versions into a single directory and tweaking my code to check the DATA environment variable value and then load data assets as-needed. I will probably not import this data directory into svn or git: most of the data seldom changes and some of the assets are huge.