Why Full Text’s CONTAINS Queries Are So Slow

T-SQL
14 Comments

SQL Server’s full text search is amazing. Well, it amazes me at least – it has so many cool capabilities: looking for prefixes, words near each other, different verb tenses, and even thesaurus searches. However, that’s not how I see most people using it: I’ve seen so many shops using it for matching specific strings, thinking it’s going to be faster than LIKE ‘%mysearch%’. That works at small scale, but as your data grows, you run into a query plan performance problem.

When your query uses CONTAINS, SQL Server has a nasty habit of doing a full text search across all of the rows in the table rather than using the rest of your WHERE clause to reduce the result set first.

This isn’t a big deal for small text data sets, like under 10GB, but as your data grows linearly, your query time goes up linearly too. By the time your data grows to real production sizes, full text can get uncomfortably painful.

To illustrate it, let’s say I’m using the Stack Overflow database, and I want to optimize this query:

The query runs in milliseconds if I have an index on CreationDate and Title:

Because SQL Server chooses to seek to my specific date first, then check the posts’ titles, as the actual query plan shows:

 

That’s awesome! Nobody’s gonna complain about that speedy query. If you hover your mouse over the Index Seek, it had to read 4,510 rows to find 100 rows that had “SQL” in the title – but that’s not terrible. It only did 100 key lookups, too. It’s super-quick:

However, try the same thing with full text.

I’m going to set up full text search and create a full text index on the Title column:

If you try that yourself, be aware that the above query will finish instantly, but it’s kicking off a full text index population in the background. That takes way, way longer to complete than the nonclustered index did, so while you wait, here’s how to check the status of your full text index populations.

When the full text population finishes, try the full text version of the query with CONTAINS:

And the query is slower because it does more work:

Because of the different plan:

The plan operators, top to bottom:

  1. SQL Server does an index seek on CreationDate, but it reads more rows
  2. It does 8,874 key lookups – up from just 100
  3. The full text search finds all posts with java in the title, regardless of date – producing 440,615 rows

Complex searches are even worse.

Let’s throw in a simple “and”, looking for both java and sql:

The query takes over 20 seconds thanks to an imbalance of work across cores:

And the actual plan shows the inefficiencies of work when we’re only trying to get 100 rows:

Why does SQL Server do it this way? I dunno – it just always has. It finds all the data that matches, and then joins to the rest of the data you’re searching for. This isn’t a new problem – it’s been the same way for over a decade – and it’s one of the reasons Stack Overflow had to move their search out of SQL Server as well.

Just for laughs, if I use LIKEs with leading % signs:

That finishes in less than a second, with 2,426 logical reads. Here’s the actual plan.

Full text search doesn’t integrate well in query plans.

If you have a really simple query and you’re filtering for very unusual keywords, then full text search is pretty dang good. It brings powerful capabilities to SQL Server built right into the box.

However, the more complex your query plan becomes, like the more filtering that’s being done on other tables, AND the more common your search keywords are, the angrier you’ll become with full text performance. SQL Server will choose to perform the full text search across all of the rows regardless of your other filters, then drag a ton of data back into the execution plan and force the rest of the operators to deal with it.

In this particular example, a like with a leading % sign – normally a DBA’s enemy – is actually a performant solution. I wanna stress, though, that that isn’t a solution that works in the majority of shops.

In other situations, I’ve had to point folks toward:

  • Building a terms table – and have people check boxes to pick from common terms, or have your app examine their search string to redirect over to the terms table rather than use full text search
  • Paul White’s trigram search solution – which requires you to build a separate table, keep it in sync, and rewrite your app’s searches to point at it
  • Elasticsearch – but then it’s left up to you to keep SQL Server and Elasticsearch’s contents in sync, and to rewrite your app’s searches to point at Elastic rather than SQL Server
  • Scaling out with Availability Group readable replicas – which costs a fortune, since they have to be Enterprise Edition, but at least you can scale the search workloads out without changing your application, or it can buy you time while you rewrite the search part of your app

None of those solutions are easy or cheap.

Previous Post
Free Webcast: Why is the Same Query Sometimes Slow?
Next Post
Announcing a New Live Online Class: Fundamentals of TempDB

14 Comments. Leave new

  • Thanks for this, I’ve been waiting for it since you said you were going to blog about it.
    We are upgrading IBM Maximo to 7.6 switching from Oracle to MS SQL Server and the DB has been configured using the full-text search.
    I only noticed it as they were leaving the indexes in place while they are transferring the data on the ‘workorder’ table which is very wide and has lots of rows 🙂 (the table is stupid wide with text fields and is around 20 million rows). I created a couple of scripts to delete and recreate the indexes but they failed due to the full-text indexes. So I rewrote them to disable and then rebuild but I think they altered the transfer process to create the indexes after the inserts (the inserts seems like the are encapsulated in a transaction with one row at a time).
    Thanks again but it also makes me feel depressed about the performance.

    Reply
  • Great link to Paul White trigrams: now we know why in many autocomplete UI starts to suggest only from 3 chars….

    Reply
  • Alex Friedman
    November 8, 2020 9:13 am

    Another vote for Elasticsearch here 🙂 These days it’s really about using the right tool for the job — while not dying maintaining too many tools.

    Reply
    • I agree, Enterprise Search Software. Unfortunately that means not SQL Server. I was musing back in 2016 after learning about Azure DataWarehouse and it’s hybrid integration with Hadoop whether or not that meant some version of Lucidworks was coming to SQL Server, I guess not.

      Reply
  • Anyone got any best practice guides / examples for sync’ing SQL Server tables with ElasticSearch? I’ll be shortly tacking that problem and would love to know how people went about it and what problems they faced.

    Reply
  • Would Azure Cognitive Search be a good alternative?

    Reply
    • It might be – how would you propose to keep that data in sync with the SQL Server’s contents?

      Reply
      • Kevin Martin
        March 24, 2021 3:47 am

        We use Azure Cognitive Search (Azure Search Services) and setup the indexes to refresh between 5 minutes – 24 hours via the Azure portal. The data sources can be Azure SQL Databases & Managed Instances tables or views (+many more). It is super important to use rowversion/timestamp as the high-water mark or you will get stale results.

        We don’t use any of the AI enrichment that comes with Cognitive Search. Our software devs use APIs on web app search pages to return results.

        We used to provision a small Azure virtual machine just for ElasticSearch but Azure Cognitive Search (Azure Search Services) does the trick for our clients.

        Mostly everything we develop (web app and database) for our clients are deployed in Azure. https://www.emergentsoftware.net/work has a representative list of client web apps that use Azure Search (some, not all).

        Reply
        • Gotcha: so just to rephrase, you need to change your tables to include a rowversion or timestamp, and then it’s automatic, but the data’s out of date by 5 minutes or more. That’s a fairly significant dealbreaker for a lot of companies, but that’s a neat fit for some!

          Reply
  • Greg Menounos
    May 10, 2021 10:15 pm

    If you know that the full text portion of the query is going to be much more selective than the non-full-text portion, you can use CONTAINSTABLE and a FORCE ORDER hint, resulting in a query that runs a lot faster. In the version of the StackOverflow DB that I downloaded, the CONTAINSTABLE portion of the join returns 1896 rows and the date-filtered portion of the Posts table is 5.4 million rows. The query executes in about a tenth of a second on my machine with an empty cache.

    SELECT TOP 100 p.*
    FROM CONTAINSTABLE(dbo.Posts, Title, ‘java and sql’) ct
    INNER JOIN dbo.Posts p ON ct.[KEY] = p.id
    WHERE p.CreationDate BETWEEN ‘2013-01-01’ AND ‘2013-12-31’
    ORDER BY p.CreationDate
    OPTION (FORCE ORDER);

    Reply
    • Greg – neat idea, but that’s pretty tough to recommend for production code for so many reasons:

      * You have to know the selectivity of what you’re looking for
      * You have to be able to build different queries (with/without the hint) when your parameters change
      * You have to be able to dramatically change the query (doubling the number of references to the Posts table) – this doesn’t work well with tools like EF or NHibernate
      * You have to be positive that FORCE ORDER will work – its performance often falls apart catastrophically on queries with multiple joins

      Neat idea though!

      Reply
  • Adam Everington
    May 18, 2021 8:25 am

    We are looking for a full-text based solution for a new app, usually we’re a SQL Server house but stacked up against other products it just doesn’t seem to cut the mustard.
    Currently looking at:
    – ElasticSearch (based on your recommendation above)
    – Postgres: full-text searches in postgres seem way faster on a like-for-like search across the same data
    – Combination of the above using ZomboDb

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu