Why Full Text’s CONTAINS Queries Are So Slow

SQL Server’s full text search is amazing. Well, it amazes me at least – it has so many cool capabilities: looking for prefixes, words near each other, different verb tenses, and even thesaurus searches. However, that’s not how I see most people using it: I’ve seen so many shops using it for matching specific strings, thinking it’s going to be faster than LIKE ‘%mysearch%’. That works at small scale, but as your data grows, you run into a query plan performance problem.

When your query uses CONTAINS, SQL Server has a nasty habit of doing a full text search across all of the rows in the table rather than using the rest of your WHERE clause to reduce the result set first.

This isn’t a big deal for small text data sets, like under 10GB, but as your data grows linearly, your query time goes up linearly too. By the time your data grows to real production sizes, full text can get uncomfortably painful.

To illustrate it, let’s say I’m using the Stack Overflow database, and I want to optimize this query:

The query runs in milliseconds if I have an index on CreationDate and Title:

Because SQL Server chooses to seek to my specific date first, then check the posts’ titles, as the actual query plan shows:

 

That’s awesome! Nobody’s gonna complain about that speedy query. If you hover your mouse over the Index Seek, it had to read 4,510 rows to find 100 rows that had “SQL” in the title – but that’s not terrible. It only did 100 key lookups, too. It’s super-quick:

However, try the same thing with full text.

I’m going to set up full text search and create a full text index on the Title column:

If you try that yourself, be aware that the above query will finish instantly, but it’s kicking off a full text index population in the background. That takes way, way longer to complete than the nonclustered index did, so while you wait, here’s how to check the status of your full text index populations.

When the full text population finishes, try the full text version of the query with CONTAINS:

And the query is slower because it does more work:

Because of the different plan:

The plan operators, top to bottom:

  1. SQL Server does an index seek on CreationDate, but it reads more rows
  2. It does 8,874 key lookups – up from just 100
  3. The full text search finds all posts with java in the title, regardless of date – producing 440,615 rows

Complex searches are even worse.

Let’s throw in a simple “and”, looking for both java and sql:

The query takes over 20 seconds thanks to an imbalance of work across cores:

And the actual plan shows the inefficiencies of work when we’re only trying to get 100 rows:

Why does SQL Server do it this way? I dunno – it just always has. It finds all the data that matches, and then joins to the rest of the data you’re searching for. This isn’t a new problem – it’s been the same way for over a decade – and it’s one of the reasons Stack Overflow had to move their search out of SQL Server as well.

Just for laughs, if I use LIKEs with leading % signs:

That finishes in less than a second, with 2,426 logical reads. Here’s the actual plan.

Full text search doesn’t integrate well in query plans.

If you have a really simple query and you’re filtering for very unusual keywords, then full text search is pretty dang good. It brings powerful capabilities to SQL Server built right into the box.

However, the more complex your query plan becomes, like the more filtering that’s being done on other tables, AND the more common your search keywords are, the angrier you’ll become with full text performance. SQL Server will choose to perform the full text search across all of the rows regardless of your other filters, then drag a ton of data back into the execution plan and force the rest of the operators to deal with it.

In this particular example, a like with a leading % sign – normally a DBA’s enemy – is actually a performant solution. I wanna stress, though, that that isn’t a solution that works in the majority of shops.

In other situations, I’ve had to point folks toward:

  • Building a terms table – and have people check boxes to pick from common terms, or have your app examine their search string to redirect over to the terms table rather than use full text search
  • Paul White’s trigram search solution – which requires you to build a separate table, keep it in sync, and rewrite your app’s searches to point at it
  • Elasticsearch – but then it’s left up to you to keep SQL Server and Elasticsearch’s contents in sync, and to rewrite your app’s searches to point at Elastic rather than SQL Server
  • Scaling out with Availability Group readable replicas – which costs a fortune, since they have to be Enterprise Edition, but at least you can scale the search workloads out without changing your application, or it can buy you time while you rewrite the search part of your app

None of those solutions are easy or cheap.

Unlike my Black Friday sale…

If you like tricks like this, you’ll love my Mastering Query Tuning class. I’ve got two upcoming sessions: December 11-13 (Fri/Sat/Sun, iCal) and January 12-14 (Tues/Weds/Thurs- iCal.) Folks with a Live Class Season Pass are welcome to drop in anytime, or just watch the Instant Replays on a schedule that works for them.

If you’ve got a Live Class Season Pass, you can drop in on this class or any of my live online classes at any time. Just head to my current training page, check out the schedule at the bottom of the page, and grab the calendar files for the classes you’re interested in. You don’t have to register ahead of time – just drop in anytime I’m streaming.

If you don’t have one yet, I’m running a Black Friday Sale this month. Get access to all of my live classes for an entire year straight for less than what it costs to go to a one-time conference.

See you in class!

Previous Post
Free Webcast: Why is the Same Query Sometimes Slow?
Next Post
Announcing a New Live Online Class: Fundamentals of TempDB

7 Comments. Leave new

  • Thanks for this, I’ve been waiting for it since you said you were going to blog about it.
    We are upgrading IBM Maximo to 7.6 switching from Oracle to MS SQL Server and the DB has been configured using the full-text search.
    I only noticed it as they were leaving the indexes in place while they are transferring the data on the ‘workorder’ table which is very wide and has lots of rows 🙂 (the table is stupid wide with text fields and is around 20 million rows). I created a couple of scripts to delete and recreate the indexes but they failed due to the full-text indexes. So I rewrote them to disable and then rebuild but I think they altered the transfer process to create the indexes after the inserts (the inserts seems like the are encapsulated in a transaction with one row at a time).
    Thanks again but it also makes me feel depressed about the performance.

    Reply
  • Great link to Paul White trigrams: now we know why in many autocomplete UI starts to suggest only from 3 chars….

    Reply
  • Another vote for Elasticsearch here 🙂 These days it’s really about using the right tool for the job — while not dying maintaining too many tools.

    Reply
    • I agree, Enterprise Search Software. Unfortunately that means not SQL Server. I was musing back in 2016 after learning about Azure DataWarehouse and it’s hybrid integration with Hadoop whether or not that meant some version of Lucidworks was coming to SQL Server, I guess not.

      Reply
  • Anyone got any best practice guides / examples for sync’ing SQL Server tables with ElasticSearch? I’ll be shortly tacking that problem and would love to know how people went about it and what problems they faced.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu
{"cart_token":"","hash":"","cart_data":""}