Agile Web Development

Build it. Launch it. Love it.

xapian_db

What’s in the box?

XapianDb is a ruby gem that combines features of nosql databases and fulltext indexing into one piece. The result: Rich documents and very fast queries. It is based on {Xapian}[http://xapian.org/], an efficient and powerful indexing library.

XapianDb is inspired by {xapian-fu}[https://github.com/johnl/xapian-fu] and {xapit}[https://github.com/ryanb/xapit]. Thank you John and Ryan for your great work. It helped me learning to understand the xapian library and I borrowed an idea or two from you ;-)

Why yet another indexing gem?

In the good old days I used {ferret}[https://github.com/dbalmain/ferret] and {acts_as_ferret}[https://github.com/jkraemer/acts_as_ferret] as my fulltext indexing solution and everything was fine. But time moved on and Ferret didn’t.

So I started to rethink fulltext indexing again. I looked for something that

  • is under active development
  • is fast
  • is lightweight and easy to install / deploy
  • is framework and database agnostic and works with pure POROS (plain old ruby objects)
  • is configurable anywhere, not just inside the model classes; I think that index configurations should not be part of the domain model
  • supports document configuration at the class level, not the database level; each class has its own document structure
  • integrates with popular Ruby / Rails ORMs like ActiveRecord or Datamapper through a plugin architecture
  • returns rich document objects that do not necessarily need a database roundtrip to render the search results (but know how to get the underlying object, if needed)
  • updates the index realtime (no scheduled reindexing jobs)
  • supports all major features of a full text indexer, namely wildcards!!

I tried hard but I couldn’t find such a thing so I decided to write it, based on the Xapian library.

If you found a bug or are looking for a missing feature, please post to the {Google Group}[http://groups.google.com/group/xapian_db]

Requirements

  • ruby 1.9.2 or newer
  • rails 3.0 or newer (if you want to use it with rails)

Getting started

If you want to use xapian_db in a Rails app, you need Rails 3 or newer.

For a first look, look at the examples in the examples folder. There’s the simple ruby script basic.rb that shows the basic usage of XapianDB without rails. In the basic_rails folder you’ll find a very simple Rails app unsing XapianDb.

The following steps assume that you are using xapian_db within a Rails app.

Configure your databases

Without a config file, xapian_db creates the database in the db folder for development and production environments. If you are in the test environment, xapian_db creates an in memory database. It assumes you are using ActiveRecord.

You can override these defaults by placing a config file named ‘xapian_db.yml’ into your config folder. Here’s an example:

  # XapianDb configuration
  defaults: &defaults
    adapter: datamapper # Avaliable adapters: :active_record, :datamapper
    language: de        # Global language; can be overridden for specific blueprints

  development:
    database: db/xapian_db/development
    <<: *defaults

  test:
    database: ":memory:" # Use an in memory database for tests
    <<: *defaults

  production:
    database: db/xapian_db/production
    <<: *defaults

If you do not configure settings for an environment in this file, xapian_db applies the defaults.

Configure an index blueprint

In order to get your models indexed, you must configure a document blueprint for each class you want to index:

  XapianDb::DocumentBlueprint.setup(Person) do |blueprint|
    blueprint.attribute :name, :weight => 10
    blueprint.attribute :first_name
  end

The example above assumes that you have a class Person with the methods name and first_name. Attributes will get indexed and are stored in the documents. You will be able to access the name and the first name in your search results.

If you want to index additional data but do not need access to it from a search result, use the index method:

  blueprint.index :remarks, :weight => 5

If you want to declare multiple attributes or indexes with default options, you can do this in one statement:

  XapianDb::DocumentBlueprint.setup(Person) do |blueprint|
    blueprint.attributes :name, :first_name, :profession
    blueprint.index      :notes, :remarks, :cv
  end

Note that you cannot add options using this mass declaration syntax (e.g. blueprint.attributes :name, :weight => 10, :first_name is not valid).

Use blocks for complex evaluations of attributes or indexed values:

  XapianDb::DocumentBlueprint.setup(IndexedObject) do |blueprint|
    blueprint.attribute :complex do
      if @id == 1
        "One"
      else
        "Not one"
      end
    end
  end

You may add a filter expression to exclude objects from the index. This is handy to skip objects that are not active, for example:

    XapianDb::DocumentBlueprint.setup(Person) do |blueprint|
      blueprint.attributes :name, :first_name, :profession
      blueprint.index      :notes, :remarks, :cv
      blueprint.ignore_if {active == false}
    end

You can override the global adapter configuration in a specific blueprint. Let’s say you use ActiveRecord, but you have one more class that is not stored in the database, but you want it to be indexed:

XapianDb::DocumentBlueprint.setup(SpecialClass) do |blueprint|

  blueprint.adapter :generic
  blueprint.index   :some_stuff

end

place these configurations either into the corresponding class or - I prefer to have the index configurations outside the models - into the file config/xapian_blueprints.rb.

Update the index

xapian_db injects some helper methods into your configured model classes that update the index automatically for you when you create, save or destroy models. If you already have models that should now go into the index, use the method rebuild_xapian_index:

  Person.rebuild_xapian_index

To get info about the reindex process, use the verbose option:

  Person.rebuild_xapian_index :verbose => true

In verbose mode, XapianDb will use the progressbar gem if available.

To rebuild the index for all blueprints, use

  XapianDb.rebuild_xapian_index

Query the index

A simple query looks like this:

  results = XapianDb.search "Foo"

You can use wildcards and boolean operators:

  results = XapianDb.search "fo* or baz"

You can query attributes:

  results = XapianDb.search "name:Foo"

You can query objects of a specific class:

  results = Person.search "name:Foo"

You can search for exact phrases:

  results = XapianDb.search('"this exact sentence"')

If you want to paginate the result, pass the :per_page argument:

  results = Person.search "name:Foo", :per_page => 20

If you want to limit the number of results, pass the :limit argument (handy if you use the query for autocompletion):

    results = Person.search "name:Foo", :limit => 10

On class queries you can specifiy order options:

  results = Person.search "name:Foo", :order => :first_name
  results = Person.search "Fo*", :order => [:name, :first_name], :sort_decending => true

Please note that the order option is not available for global searches (XapianDb.search…)

Process the results

XapianDb.search returns a resultset object. You can access the number of hits directly:

  results.hits # Very fast, does not load the resulting documents; always returns the actual hit count
               # even if a limit option was set in the query

If you use a persistent database, the resultset may contain a spelling correction:

  # Assuming you have at least one document containing "mouse"
  results = XapianDb.search("moose")
  results.spelling_suggestion # "mouse"

The results behave like an array:

  doc = results.first
  puts doc.score.to_s         # Get the relevance of the document
  puts doc.indexed_class      # Get the type of the indexed object as a string, e.g. "Person"
  puts doc.name               # We can access the configured attributes
  person = doc.indexed_object # Access the object behind this doc (lazy loaded)

Use a search result with will_paginate in a view:

  <%= will_paginate @results %>

Or with kaminari:

  <%= kaminari @results %>

Facets

If you want to implement a simple drilldown for your searches, you can use a global facets query:

  search_expression = "Foo"
  facets = XapianDb.facets(search_expression)
  facets.each do |klass, count|
    puts "#{klass.name}: #{count} hits"

    # This is how you would get all documents for the facet
    # doc = klass.search search_expression
  end

A global facet search always groups the results by the class of the indexed objects. There is a class level facet query syntax available, too:

  search_expression = "Foo"
  facets = Person.facets(:name, search_expression)
  facets.each do |name, count|
    puts "#{name}: #{count} hits"
  end

At the class level, any attribute can be used for a facet query.

Transactions

You can execute a block of code inside a XapianDb transaction. This ensures that the changed objects in your block will get reindexed only if the block does not raise an exception.

  XapianDb.transaction do
    object1.save
    object2.save
  end

Bulk inserts / updates / deletes

When you change a lot of models, it is not very efficient to update the xapian index on each insert / update / delete. Instead, you can use the auto_indexing_disabled method with a block and rebuild the whole index afterwards:

  XapianDb.auto_indexing_disabled do
    Person.each do |person|
      # change person
      person.save
    end
  end
  Person.rebuild_xapian_index

Production setup

Since Xapian allows only one database instance to write to the index, the default setup of XapianDb will not work with multiple app instances trying to write to the same database (you will get lock errors). Therefore, XapianDb provides a solution based on beanstalk to overcome this.

1. Install beanstalkd

Make sure you have the {beanstalk daemon}[http://kr.github.com/beanstalkd/] installed

OSX

The easiest way is to use macports or homebrew:

  port install beanstalkd
  brew install beanstalkd

Debian (Lenny)

  # Add backports to /etc/apt/sources.list:
  deb http://ftp.de.debian.org/debian-backports lenny-backports main contrib non-free
  deb-src http://ftp.de.debian.org/debian-backports lenny-backports main contrib non-free

  sudo apt-get update
  sudo apt-get -t lenny-backports install libevent-1.4-2
  sudo apt-get -t lenny-backports install libevent-dev
  cd /tmp
  wget --no-check-certificate https://github.com/downloads/kr/beanstalkd/beanstalkd-1.4.6.tar.gz
  tar xvf beanstalkd-1.4.6.tar.gz
  cd beanstalkd-1.4.6/
  ./configure
  make
  sudo make install

2. Add the beanstalk-client gem to your config

  gem 'beanstalk-client' # Add this to your Gemfile
  bundle install

3. Install the beanstalk worker script

  rails generate xapian_db:install

4. Configure your production environment in config/xapian_db.yml

  production:
    database: db/xapian_db/production
    writer:   beanstalk
    beanstalk_daemon: localhost:11300

5. start the beanstalk daemon

  beanstalkd -d

6. start the beanstalk worker from within your Rails app root directory

  RAILS_ENV=production script/beanstalk_worker start

If everything is fine, you should find a file namend beanstalk_worker.pid in tmp/pids. If something goes wrong, you’ll find beanstalk_worker.log instead showing the stack trace.

Important: Do not start multiple instances of this daemon!

Vitals

Home https://github.com/garaio/xapian_db
Repository git://github.com/garaio/xapian_db.git
License Ruby's
Rating (4 votes)
Owner Kogler Gernot
Created 2 May 2011

Comments

Add a comment