This will by no means be an exhaustive list of differences between using Sphinx and using Ferret, but we’ll look at a few major differences between the way these two search engines are implemented via acts_as_ferret (AAF) and Thinking Sphinx (TS).
First thing is first, TS is in much more active development. At the time of writing, TS last had a commit to the offical repo 13 days ago in comparison with three months ago for AAF. And it shows. acts_as_ferret discussion these days is minimal online and most of the tutorials are rather old. Meanwhile, Thinking Sphinx has a very active google group and several more recent tutorials including a fairly recent Railscast.
So Thinking Sphinx wins in terms of active development.
Another place TS blows AAF out of the water is in speed and resource usage. Sphinx uses kilobytes of memory where a ferret daemon will sit on megabytes, having to load your entire Rails app into memory. For example, on my machine, the Sphinx daemon sat at 376 KB while my ferret process ate 57.69 MB. Not kidding.
Ferret is unstable in production. Segfaults, corrupted indexes
galore. We’ve switched around 40 clients form ferret to sphinx and
solved their problems this way. I will never use ferret again after
all the problems I have seen it cause peoples production apps.
Plus sphinx can reindex many many times faster then ferret and uses less cpu and memory as well.
Anecdotally, that’s my experience as well. Thinking Sphinx can index my database in less than a minute, while acts_as_ferret can take up to 30 minutes or more.
Simply put, acts_as_ferret obeys ActiveRecord, while Thinking Sphinx goes low-level.
In his Thinking Sphinx Peepcode PDF, Pat Allen writes “For those familiar with Ferret, Sphinx is quite similar, except that Sphinx talks directly to database servers – both MySQL and PostgreSQL – to obtain the data to index.”
This is largely what gives Sphinx its speed advantage, but it also makes Thinking Sphinx dumb as far as your ActiveRecord models are concerned.
For instance, this means that TS isn’t aware of your acts_as_paranoid models until you add the deleted_at conditional to your define_index block.
define_index do
indexes [:body, :title]
where ['deleted_at is NULL']
end
This also means that TS can’t index computed values as easily. In AAF you can index methods on your object, so you could index a method like the following
def ordinalized_names_of_children
ordinalized_children = []
self.children.sort_by(&:birth_date).each_with_index do |child, i|
ordinalized_children << [child.first_name, i.to_ordinal]
end
ordinalized_children
end
This is a silly example, but to accomplish the same with TS you need to use db-specific string transformations and add all your conditional logic to the query as well. And you can easily imagine more complicated examples. Where with AAF you have the entire landscape of Ruby to use and abuse and you’re instantly inheriting the constraints of ActiveRecord, with TS you’re limited to what can be done solely on the database level. Luckily this is usually enough.
As far as ease of handling updates goes, acts_as_ferret has a big initial advantage. From Gregg Pollack
the index gets modified every time you add/edit/remove the ActiveRecord model it’s associated with. You never have to worry about doing this yourself, it happens automatically, so your search index is always 100% accurate. No rebuilding needed.
With Thinking Sphinx, you need to specify something called delta indexes on the models you want to keep up-to-date for searches between index rebuilds. This is a little more intrusive than AAF’s approach since you also have to add a field to your table called “delta” to track what has updated… but a single boolean field doesn’t incur much overhead. You’ll still need to periodically rebuild your indexes regularly as the delta indexes can slow things down over time.
In both AAF and TS, deleted models are immediately removed from the index.
To sum up the differences:
From my experience, very few model updates need to be instantly available for search, and both approaches have their pros and cons. Though it requires slightly more work on your part, I feel TS puts more control in your hands.
The winner here is obviously Thinking Sphinx. You use less resources and get better speed, reliability, and the future looks a lot more sure for support. Sure, you may have to get your hands a little dirtier with some SQL, but the benefits more than make up for it.
Also (and I get nothing out of this), you should buy the Peepcode PDF as it will give you a huge head start on Thinking Sphinx.
There’s another thing Ferret can do that Sphinx cannot. As Section 3.5 of the Sphinx documentation states, “ALL DOCUMENT IDS MUST BE UNIQUE UNSIGNED NON-ZERO INTEGER NUMBERS (32-BIT OR 64-BIT, DEPENDING ON BUILD TIME SETTINGS).” And Thinking Sphinx enforces this in its config file by performing math on your table’s id column to help create unique sphinx index ids.
99 times out of 100 this is fine since most tables just have auto-incremented integer ids anyway, but what if you have tables with ids of significant value? That’s the situation I found myself in when adopting Thinking Sphinx on my current project. We have a ton of external data coming in and much of that data already has a GUID so we decided early on to use the GUID as a primary key and foreign key as that would allow us to later recreate any table without having to worry about the foreign key integrity issues that can sometimes be a taxing side-effect of using auto-incremented ids.
My first approach to overcoming this limitation was to add an auto-incremented column named “id” to the table and then make use of set_primary_key in Rails. Unfortunately, once you do that, Thinking Sphinx tries to call that primary key you specified. So Thinking Sphinx had to be patched. Essentially, I added a method set_sphinx_primary_key to allow you to specify a primary key that TS should use regardless of what the ActiveRecord model specifies as its primary key.
So in the example:
class Robot < ActiveRecord::Base
# The key ActiveRecord will use on joins, map to id, etc.
# Setting the primary key isn't necessary for set_sphinx_primary_key to work
set_primary_key :internal_id
# The key sphinx will use for indexing, must be a unique integer
set_sphinx_primary_key :alternate_primary_key
define_index do
indexes :name
end
end
ActiveRecord will use the field internal_id on the “robots” table (the set_primary_key could just as well be left out and ActiveRecord would use the default “id”). But while ActiveRecord uses internal_id, Thinking Sphinx will instead use alternate_primary_key. So our robots can internally use a GUID string while Sphinx is still provided with the integer column it needs to index the robots.
You can find these updates in my github branch of Thinking Sphinx. I have no idea if Pat will ever merge these into the main repo, as it is admittedly a niche need. But if you find yourself in the situation I found myself in, it can really help you overcome this limitation of Sphinx.
Update: Due to popular (or at least some) demand, my changes are merged into the Thinking Sphinx master branch. Enjoy.