Railsmagazine60x60 Implement a Full Text Search Engine with Xapian

by Amit Mathur

Issue: Winter Jam

published in December 2009

Amit gravatar

Amit Mathur is a freelance programmer and creates web applications with Ruby on Rails.

In his 9 years of professional career, Amit worked with EDA, embedded software, distributed computing, web services and web applications. He also consults and provides training on web applications and Ruby on Rails. See more about Amit at: http://www.magnionlabs.com.

In his free time, Amit dabbles in home automation, robotics and sometimes sketching and painting. He lives with his wife and son in Bangalore, India. You can reach Amit at akmathur /at/ gmail /dot/ com or follow him on Twitter (akmags).

Introduction

Today’s Google conditioned web users expect a search box on every site.

Having an accurate and fast search could be your app’s killer feature. In this article we will cover a way of adding one for your web application.

Definition

So what does a full text search engine mean in the context of a database driven web application? Let us say you have a users table (and a User model) where you store all users in your system with columns like name, email, hashed_password, brief_bio etc. You could allow visitors to your site to search the name and brief_bio columns and get back a list of users. Now consider, you also had an table called hobbies (and a model Hobby) and each user could have one or more hobbies (User has_many :hobbies). You could then allow visitors to search by name, brief bio and hobby fields togther and get back a list of users.

In summary, a full text search engine allows you to provide keywords and get back a list of models.

Indexing

Searching the whole database every time can become inefficient if the database is large. To solve this problem, indexing is used. An indexer runs over the database periodically and incrementaly updates a secondary database called index or concordance. Whenever a search is requested, this index, insead of the original database is searched. An indexer can also be more useful than just speeding up the search. We can configure the indexer to ignore certain common words (called stop words) like “the”, “in” or other application specific words; or consider variations of a words as same eg. “stamp collector”, “stamp collection” and “philately” as being same (called stemming).

Options for a Rails project

Firstly, most databases provide some sort of full text index and search. However, if you are using MySQL database, you are most likely going to need an external search engine. MySQL’s built-in full text indexer works only for MyISM tables and may not be fast or customizable enough for your application’s needs. If you are using some other database like Oracle or MS-Access you may be able to get away with your DB’s built-in indexer. However, read on for what a cutting edge full text search engine can do.

For a full text search engine there are several possible options for a Rails project:

  • acts_as_indexed: A very simple to set up Rails plugin. Suitable for small applications.

  • acts_as_tsearch: A plugin which wraps the built-in full text search engine of PostgreSQL.

  • Ferret with acts_as_ferret plugin: Ferret is a Ruby rewrite of Apache’s legendary Lucene search engine.

  • Solr with acts_as_solr plugin: Solr is a search server based on Lucene which runs inside a Java Servlet container like Tomcat.

  • Sphinx with any of Ultrasphinx, Thinking Sphinx, acts_as_sphinx, or Sphincter plugins: Sphinx is very powerful search engine with fast indexing

  • Hyper Estraier with search_do plugin: Hyper Estraier is another good open source option. It has a P2P architecture but lacks some bells and whistles like spelling correction and stemming.

  • Xapian with acts_as_xapian plugin: A quite powerful search engine. We will consider this in detail here.

All the above options are mature and are used in production by multiple sites. Like everything else, they each have their upsides and downsides and the debates about them sometimes gets polarized. However, let me explain Xapian/acts as xapian which is my favorite of the lot. Xapian (http://xapian.org) is a GPL search engine library. It is written in C++ (read: very fast), highly scalable and comes with lots of killer features some of which we will explore here. For example, you can use Google like query syntax (aka “search commands”) and do searches like site:www.example.com, allinurl:energy or last_name:smith. Acts_as_xapian (http://groups.google.com/group/acts_as_xapian) is a Rails plugin with a pretty straightforward interface. It integrates Xapian into ActiveRecord. It was started at http://github.com/frabcus/acts_as_xapian/tree/master and then moved here: http://github.com/Overbryd/acts_as_xapian/tree/master (Now the cutting edge seems to be here: http://github.com/xspond/acts_as_xapian).

Let us see how we can integrate Xapian in a Rails project.

Firstly, you will have to install Xapian. Depending on your Linux distribution, you can install a pre-built package. e.g. if you are running Ubuntu:

$ sudo apt-get install libxapian15 libxapian-ruby1.8

Otherwise, download, compile and install xapian-core and xapian-bindings from http://xapian.org/download.

Install the acts as xapian plugin:

$ ruby script/plugin install http://github.com/Overbryd/acts_as_xapian/tree/master

Generate the migration:

$ ruby script/generate acts as xapian

$ rake db:migrate # this creates a table

# called acts_as_xapian_jobs

Create file config/xapian.yml specifying the path to store the Xapian DB:

$ cat config/xapian.yml

development:

base_db_path: ’tmp/xapian’

production:

base_db_path: ’../../shared/xapian’

test:

base_db_path: ’tmp/xapian’

Now you are all set to start indexing your models. Continuing with our users table example, this is how you would tell Xapian to index name and brief bio columns:

class User < ActiveRecord::Base

acts_as_xapian :texts => [:name, :brief_bio]

end

Then to build the index, for the first time do this:

$ RAILS ENV=development rake xapian:rebuild index models=”User”

and to update the index anytime later:

$ RAILS ENV=development rake xapian:update index

and all your database changes will be reflected in the index. From within your controller, where ever you want to return the search results to the user:

@search = ActsAsXapian::Search.new([User], params[:q])

Assuming the string user wanted to search is in params[:q]. This returns the search results with some additional meta information:

  • A count of approximate number of matches: @search.matches_estimated

  • A “Did you mean” type suggested query: @search.spelling_correction

  • A list of words you can highlight using Rail’s highlight helper: @search.words_to_highlight

To collect the models returned as search results:

@found_users = @search.results.collect { |result| result[:model] }

Xapian also returns a relevancy percentage with each result which you can print in your view:

<% @search.results.each do |result| %>

<%# print the model: result[:model] %>

 

<%# print relevancy: result[:weight] %>

<%# or print relevancy as a percentage: result[:percentage] %>

<% end %>

If you want to run a “similar search”, do this:

@similar_users = ActsAsXapian::Similar.new([User], @found_users).results.collect { |result| result[:model] }

Specifying what to index

Full syntax of acts as xapian declaration is:

acts_as_xapian :texts => [ :name, :brief_bio ],

:values => [ [:created_at, 0, ‘‘created_at’’, :date ],

[:age, 1, ‘‘age’’, :number ] ],

:terms => [ [ :hobbies, "H", "allinhobby" ] ],

:if => :available_for_search

Every model attribute which needs indexing must be specified under one of the :texts, :values or :terms arguments. Those specified under :values and :terms get a search command like created_at, age and allinhobby above. e.g. you could say, stanford allinhobby:gardening to search for users with stanford in name or brief_bio columns and gardening as a hobby. The attributes specified under :values can be range searched like created_at:01/10/2009..01/11/2009.

You can use the :if parameter to prevent records which you don’t want to show up in search results eg. admin users. Just return false from the method available for search.

Indexing more than just database columns

The elements of array specified with :index need not be attributes of the model. You can specify any method in the model. It is perfectly fine to do:

:index => [:last_name]

where last_name is a method in User model which just returns the last name string from name attribute.

Grouping with :values

The attributes specified under values can be range searched or used for collapsing like GROUP BY sql clause. e.g.

@search = ActsAsXapian::Search.new([User], params[:q], :collapse_by_prefix => ‘‘age’’)

will return only one result for each value of age. Also, result[:collapse_count] will give you number of records collapsed into that result.

Search Commands

The syntax for :term option may first appear a bit cryptic. Here’s an explanation:

:terms => [ [ :hobbies, ‘‘H’’, ‘‘allinhobby’’ ], [:lastname, ‘‘L’’, ‘‘lastname’’ ] ]

The argument must be an array of arrays. Each sub-array must have 3 elements: attribute or method name which returns a string, an arbitrary single upper case letter which must be unique and a prefix which acts as the search command.

If you have multiple models with different attributes all of which you want to search with the same command, you can use the same character and prefix in both models to do that.

You can either directly document the search commands or use them to build an advanced search form and internally convert the fields from there to the search commands.

Index updates

Since Xapian does not update the index as new data comes into your database, you will have to do it manually. Most common way is to set up a cron job to run the rake task periodically. Remember, the search results include the new data coming into the database until indexing has run. Xapian has a fast indexer, but experiemnt to find out the optimal interval for you. For one project, I have this in my crontab:

$ crontab -l

#

# MIN(0-59) HOUR(0-23) Day-of-month(1-31) MONTH(1-12) DOW(0-6) COMMAND

#

30 * * * * (cd /var/www/rails/apps/myproject/current; \\

rake xapian:update_index RAILS_ENV=production) \\

>>/var/log/rails/apps/myproject/shared/log/xapian_update_index.log

Tips and tricks

There are several other nifty things you can do with Xapian. Here are some notable ones:

  • boolean and wildcard queries: You can use “and”, “or”, “+” and “-” and wildcard “*” in search queries as you have come to expect from Google. For details about the syntax see here: http://xapian.org/docs/queryparser.html.

  • Pagination: While calling ActsAsXapian::Search.new, you can specify :offset to specify offset of the first result and :limit to get desired number of results per page.

  • If you need to search mutliple models acts_as_xapian does that out of the box and returns them mixed up together by relevancy, e.g. @search = ActsAsXapian::Search.new([User, Vehicle], params[:q])

Licensing

Xapian and its bindings are licensed under GPL v2. So, depending on whether you are planning to use it with a commerical web application (software as a service model) or as part of commercial installed software, you should understand the obligations under it. However, Xapian team is working towards releasing it under a more permissive license.

Conclusion

Xapian makes search easy, reliable and quite zappy. Although it presently only supports offline indexing that is a limitation of acts_as_xapian plugin. If you are feeling motivated, you can always go ahead and add real time indexing.

Xapian is in use in several large installations like debian.org, gmane.org, delicious.com and OLPC’s desktop search (full list here: http://xapian.org/users.php) and with acts_as_xapian is deployed in whatdotheyknow.com, mindbites.com, shipx.in and many more sites. For any questions, head to the Google group: http://groups.google.com/acts_as_xapian.