Text search in SimpleDB: a Ruby example
You might want to use SimpleDB for storage and to support text indexing and search if you did not want to manually run and administer Solr yourself. Here is a little snippet that shows how to store searchable documents in SimpleDB:
If you run this example remotely from your laptop, notice that remote SimpleDB access is a little slow. When run on a small EC2 instance, it takes about 0.05 seconds to add a "document" to SimpleDB and about 0.1 seconds to search using two search terms.
require 'rubygems' require 'aws_sdb' SERVICE = AwsSdb::Service.new # assuming that this domain is already created DOMAIN = "some_test_domain_7854854" class Document def initialize name, text words = (name + ' ' + text).downcase.split.uniq attributes = {:words => words, :text => text} SERVICE.put_attributes(DOMAIN, name, attributes) end def Document.search query # The last inject takes the intersection and # insures that all search terms are present: keys = query.downcase.split.collect {|x| SERVICE.query(DOMAIN, "['words' starts-with '#{x}']")[0] }.inject {|x, y| x & y } keys.collect {|key| SERVICE.get_attributes(DOMAIN, key)} end end Document.new('title1', 'The bird flew to the lake for water') Document.new('title2', 'The dog chased the cat') p Document.search 'flew lake'The formatting of this code snippet is odd because I was trying to get short lines to fit the page width. This code snippet is not terribly efficient but since the first 25 Amazon SimpleDB Machine Hours consumed per month are free for your Amazon AWS account using this code example in your applications can end up being almost free (there are small data storage and bandwidth charges) and you get the advantage of no administration hassles. The output for the above code snippet is:
[{"text"=>["The bird flew to the lake for some water"], "words"=>["bird", "flew", "for", "lake", "the", "title1", "to", "water"]}]There are two improvements that you can implement: remove noise/stop words from the words attribute and make the code multithreaded to execute the individual SimpleDB queries in parallel when possible to do so. I was trying to make this example code snippet concise. For simple and/or moderately used applications these improvements aren't necessary.
If you run this example remotely from your laptop, notice that remote SimpleDB access is a little slow. When run on a small EC2 instance, it takes about 0.05 seconds to add a "document" to SimpleDB and about 0.1 seconds to search using two search terms.
Comments
Post a Comment