Using Lucene with JRuby

I use the Ruby Ferret indexing and search library a lot. Ferret is a port (some Ruby, mostly C) of Lucene. I have recently been getting into using JRuby. A few days ago, I discovered that it was reasonable easy to run a simple Rails web application using the Java application server JBoss using JRuby (this took me an hour - next time will be easy). Today, I spent a short while getting Lucene and JRuby working together:
require "java"
require "lib/lucene-core-2.1.0.jar"

class Lucene
@index_path = nil
def initialize(an_index_path = "data/")
@index_path = an_index_path
end
def add_documents id_text_pair_array # e.g., [[1,"test1"],[2,'test2']]
index_available = org.apache.lucene.index.IndexReader.index_exists(@index_path)
index_writer = org.apache.lucene.index.IndexWriter.new(
@index_path,
org.apache.lucene.analysis.standard.StandardAnalyzer.new,
!index_available)
id_text_pair_array.each {|id_text_pair|
term_to_delete = org.apache.lucene.index.Term.new("id", id_text_pair[0].to_s) # if it exists
a_document = org.apache.lucene.document.Document.new
a_document.add(org.apache.lucene.document.Field.new('text', id_text_pair[1],
org.apache.lucene.document.Field::Store::YES,
org.apache.lucene.document.Field::Index::TOKENIZED))
a_document.add(org.apache.lucene.document.Field.new('id', id_text_pair[0].to_s,
org.apache.lucene.document.Field::Store::YES,
org.apache.lucene.document.Field::Index::TOKENIZED))
index_writer.updateDocument(term_to_delete, a_document) # delete any old docs with same id
}
index_writer.close
end
def search(query)
parse_query = org.apache.lucene.queryParser.QueryParser.new(
'text',
org.apache.lucene.analysis.standard.StandardAnalyzer.new)
query = parse_query.parse(query)
engine = org.apache.lucene.search.IndexSearcher.new(@index_path)
hits = engine.search(query).iterator
results = []
while (hits.hasNext && hit = hits.next)
id = hit.getDocument.getField("id").stringValue.to_i
text = hit.getDocument.getField("text").stringValue
results << [hit.getScore, id, text]
end
engine.close
results
end
def delete_documents id_array # e.g., [1,5,88]
index_available = org.apache.lucene.index.IndexReader.index_exists(@index_path)
index_writer = org.apache.lucene.index.IndexWriter.new(
@index_path,
org.apache.lucene.analysis.standard.StandardAnalyzer.new,
!index_available)
id_array.each {|id|
index_writer.deleteDocuments(org.apache.lucene.index.Term.new("id", id.to_s))
}
index_writer.close
end
end
This code assumes that the Java Lucence JAR file lucene-core-2.1.0.jar is in the subdirectory lib. A short test program is:
require "lucene"
require 'pp'

ls = Lucene.new
ls.add_documents([[1,"test one two"],[2,'testing 1 2 3'], [3,'this is a longer test string']])
ls.delete_documents([1]) # optional: test document delete from index
pp ls.search("test")
I had some hesitations about JRuby: I was concerned that using JRuby would lack the light weight feel of hacking in native Ruby. No worries though: JRuby is easy and quick to work with.

Comments

  1. Very nice...I think there's potential here. Perhaps there's a way to make something that joins ferret and lucene syntaxes, but uses Lucene where appropriate under the covers?

    ReplyDelete
  2. Hello Charles,

    I thought of that, letting people switch between:

    require 'ferret'

    or:

    require 'lucene'

    My example here was just a quick hack. BTW, I was pleased at how easy it was to run a Rails application on JBoss - that was cool!

    ReplyDelete

Post a Comment

Popular posts from this blog

Ruby Sinatra web apps with background work threads

My Dad's work with Robert Oppenheimer and Edward Teller

Time and Attention Fragmentation in Our Digital Lives