Java trick for reading OpenOffice.org files

January 08, 2005

I have read a few times about people having problems reading the ZIPed XML files in OpenOffice.org documents. The problem is SAX parsers not being able to locate a local copy of the office.dtd file. I have been using a kluge to get around this problem for a long time and have not had any problems with it: When reading the input stream from the ZIP file entry labeled "content.xml", skip past the second ">" character:


InputSource is =

    new InputSource(zf.getInputStream(zipEntry));

InputStream r = is.getByteStream();

for (int i=0, count = 0; i<500; i++) {

    if ((char)r.read() == '>') count++;

    if (count > 1)  break;

}

SAXParser p = saxFactory.newSAXParser();

p.parse(r, new OpenOffice.OpenOfficeSaxHandler());

Hopefully in the future people having this problem will find this post when doing a web search and save themselves a little time. Another good alternative is to make office.dtd available on your system and put it on your classpath.

Comments

Dipti10:59 PM
Hi Mark,

i'm working on lucene project for searching in open office documents.I have followed your instruction from an article on http://www.devx.com/java/Article/27728/1954, but in that i didn't get what is MyDTD file.

Problem is that my application is searching fine in pdf,doc,rtf,xml and txt. But its not searching in .odt, .sxw,.ppt etc. basically all open office documents.

please help

Regards,

Dipti
ReplyDelete
Replies

Add comment

Search This Blog

Java trick for reading OpenOffice.org files

Comments

Post a Comment

Popular posts from this blog

AI update: The new Deepseek-R1 reasoning language model, Bytedance's Trae IDE, and my new book

Wonderful book: "Land of Lisp" - Conrad Barski is a great author and communicator

I am moving back to the Google platform, less excited by what Apple is offering

Clojure vs. Scala smackdown

Nice: OpenCyc version 4.0 has been released

Ruby Sinatra web apps with background work threads

Writing a simple SQL data source for the free LGPL version of SmartGWT

Small example app using Ember.js and Node.js

Using the Datomic free edition in a lein based project

And the best JVM replacement language for Java is: Java?

Comparing Clojure + Clojurescript with Scala + Scala.js

Happy New Year

History in the making: first Lee Sedol vs. AlphaGo match game