SQL Server’s full text search is amazing. Well, it amazes me at least – it has so many cool capabilities: looking for prefixes, words near each other, different verb tenses, and even thesaurus searches. However, that’s not how I see most people using it: I’ve seen so many shops using it for matching specific strings, thinking it’s going to be faster than LIKE ‘%mysearch%’. That works at small scale, but as your data grows, you run into a query plan performance problem.
When your query uses CONTAINS, SQL Server has a nasty habit of doing a full text search across all of the rows in the table rather than using the rest of your WHERE clause to reduce the result set first.
This isn’t a big deal for small text data sets, like under 10GB, but as your data grows linearly, your query time goes up linearly too. By the time your data grows to real production sizes, full text can get uncomfortably painful.
To illustrate it, let’s say I’m using the Stack Overflow database, and I want to optimize this query:
1 2 3 4 5 |
SELECT TOP 100 * FROM dbo.Posts WHERE CreationDate BETWEEN '2013-01-01' AND '2013-12-31' AND Title LIKE '%java%' ORDER BY CreationDate; |
The query runs in milliseconds if I have an index on CreationDate and Title:
1 |
CREATE INDEX CreationDate ON dbo.Posts(CreationDate) INCLUDE (Title); |
Because SQL Server chooses to seek to my specific date first, then check the posts’ titles, as the actual query plan shows:
That’s awesome! Nobody’s gonna complain about that speedy query. If you hover your mouse over the Index Seek, it had to read 4,510 rows to find 100 rows that had “SQL” in the title – but that’s not terrible. It only did 100 key lookups, too. It’s super-quick:
1 2 3 |
Table 'Posts'. Scan count 1, logical reads 465 SQL Server Execution Times: CPU time = 0 ms, elapsed time = 181 ms. |
However, try the same thing with full text.
I’m going to set up full text search and create a full text index on the Title column:
1 2 3 4 5 6 |
CREATE FULLTEXT CATALOG ft AS DEFAULT; GO CREATE FULLTEXT INDEX ON dbo.Posts(Title) KEY INDEX PK_Posts__Id WITH STOPLIST = SYSTEM; GO |
If you try that yourself, be aware that the above query will finish instantly, but it’s kicking off a full text index population in the background. That takes way, way longer to complete than the nonclustered index did, so while you wait, here’s how to check the status of your full text index populations.
When the full text population finishes, try the full text version of the query with CONTAINS:
1 2 3 4 5 |
SELECT TOP 100 * FROM dbo.Posts WHERE CreationDate BETWEEN '2013-01-01' AND '2013-12-31' AND CONTAINS(Title, '"java"') ORDER BY CreationDate; |
And the query is slower because it does more work:
1 2 3 4 |
Table 'Worktable'. Scan count 8874, logical reads 1037897 Table 'Posts'. Scan count 1, logical reads 36139 SQL Server Execution Times: CPU time = 609 ms, elapsed time = 814 ms. |
Because of the different plan:
The plan operators, top to bottom:
- SQL Server does an index seek on CreationDate, but it reads more rows
- It does 8,874 key lookups – up from just 100
- The full text search finds all posts with java in the title, regardless of date – producing 440,615 rows
Complex searches are even worse.
Let’s throw in a simple “and”, looking for both java and sql:
1 2 3 4 5 |
SELECT TOP 100 * FROM dbo.Posts WHERE CreationDate BETWEEN '2013-01-01' AND '2013-12-31' AND CONTAINS(Title, 'java and sql') ORDER BY CreationDate; |
The query takes over 20 seconds thanks to an imbalance of work across cores:
1 2 3 4 |
Table 'Worktable'. Scan count 778381, logical reads 1573682 Table 'Posts'. Scan count 5, logical reads 3168118 SQL Server Execution Times: CPU time = 6938 ms, elapsed time = 21032 ms. |
And the actual plan shows the inefficiencies of work when we’re only trying to get 100 rows:
Why does SQL Server do it this way? I dunno – it just always has. It finds all the data that matches, and then joins to the rest of the data you’re searching for. This isn’t a new problem – it’s been the same way for over a decade – and it’s one of the reasons Stack Overflow had to move their search out of SQL Server as well.
Just for laughs, if I use LIKEs with leading % signs:
1 2 3 4 5 6 |
SELECT TOP 100 * FROM dbo.Posts WHERE CreationDate BETWEEN '2013-01-01' AND '2013-12-31' AND Title LIKE '%java%' AND Title LIKE '%sql%' ORDER BY CreationDate |
That finishes in less than a second, with 2,426 logical reads. Here’s the actual plan.
Full text search doesn’t integrate well in query plans.
If you have a really simple query and you’re filtering for very unusual keywords, then full text search is pretty dang good. It brings powerful capabilities to SQL Server built right into the box.
However, the more complex your query plan becomes, like the more filtering that’s being done on other tables, AND the more common your search keywords are, the angrier you’ll become with full text performance. SQL Server will choose to perform the full text search across all of the rows regardless of your other filters, then drag a ton of data back into the execution plan and force the rest of the operators to deal with it.
In this particular example, a like with a leading % sign – normally a DBA’s enemy – is actually a performant solution. I wanna stress, though, that that isn’t a solution that works in the majority of shops.
In other situations, I’ve had to point folks toward:
- Building a terms table – and have people check boxes to pick from common terms, or have your app examine their search string to redirect over to the terms table rather than use full text search
- Paul White’s trigram search solution – which requires you to build a separate table, keep it in sync, and rewrite your app’s searches to point at it
- Elasticsearch – but then it’s left up to you to keep SQL Server and Elasticsearch’s contents in sync, and to rewrite your app’s searches to point at Elastic rather than SQL Server
- Scaling out with Availability Group readable replicas – which costs a fortune, since they have to be Enterprise Edition, but at least you can scale the search workloads out without changing your application, or it can buy you time while you rewrite the search part of your app
None of those solutions are easy or cheap.
21 Comments. Leave new
Thanks for this, I’ve been waiting for it since you said you were going to blog about it.
We are upgrading IBM Maximo to 7.6 switching from Oracle to MS SQL Server and the DB has been configured using the full-text search.
I only noticed it as they were leaving the indexes in place while they are transferring the data on the ‘workorder’ table which is very wide and has lots of rows 🙂 (the table is stupid wide with text fields and is around 20 million rows). I created a couple of scripts to delete and recreate the indexes but they failed due to the full-text indexes. So I rewrote them to disable and then rebuild but I think they altered the transfer process to create the indexes after the inserts (the inserts seems like the are encapsulated in a transaction with one row at a time).
Thanks again but it also makes me feel depressed about the performance.
You’re welcome!
Great link to Paul White trigrams: now we know why in many autocomplete UI starts to suggest only from 3 chars….
Another vote for Elasticsearch here 🙂 These days it’s really about using the right tool for the job — while not dying maintaining too many tools.
I agree, Enterprise Search Software. Unfortunately that means not SQL Server. I was musing back in 2016 after learning about Azure DataWarehouse and it’s hybrid integration with Hadoop whether or not that meant some version of Lucidworks was coming to SQL Server, I guess not.
Anyone got any best practice guides / examples for sync’ing SQL Server tables with ElasticSearch? I’ll be shortly tacking that problem and would love to know how people went about it and what problems they faced.
Your best bet there would be to ask at a Q&A site like https://dba.stackexchange.com.
Would Azure Cognitive Search be a good alternative?
It might be – how would you propose to keep that data in sync with the SQL Server’s contents?
We use Azure Cognitive Search (Azure Search Services) and setup the indexes to refresh between 5 minutes – 24 hours via the Azure portal. The data sources can be Azure SQL Databases & Managed Instances tables or views (+many more). It is super important to use rowversion/timestamp as the high-water mark or you will get stale results.
We don’t use any of the AI enrichment that comes with Cognitive Search. Our software devs use APIs on web app search pages to return results.
We used to provision a small Azure virtual machine just for ElasticSearch but Azure Cognitive Search (Azure Search Services) does the trick for our clients.
Mostly everything we develop (web app and database) for our clients are deployed in Azure. https://www.emergentsoftware.net/work has a representative list of client web apps that use Azure Search (some, not all).
Gotcha: so just to rephrase, you need to change your tables to include a rowversion or timestamp, and then it’s automatic, but the data’s out of date by 5 minutes or more. That’s a fairly significant dealbreaker for a lot of companies, but that’s a neat fit for some!
If you know that the full text portion of the query is going to be much more selective than the non-full-text portion, you can use CONTAINSTABLE and a FORCE ORDER hint, resulting in a query that runs a lot faster. In the version of the StackOverflow DB that I downloaded, the CONTAINSTABLE portion of the join returns 1896 rows and the date-filtered portion of the Posts table is 5.4 million rows. The query executes in about a tenth of a second on my machine with an empty cache.
SELECT TOP 100 p.*
FROM CONTAINSTABLE(dbo.Posts, Title, ‘java and sql’) ct
INNER JOIN dbo.Posts p ON ct.[KEY] = p.id
WHERE p.CreationDate BETWEEN ‘2013-01-01’ AND ‘2013-12-31’
ORDER BY p.CreationDate
OPTION (FORCE ORDER);
Greg – neat idea, but that’s pretty tough to recommend for production code for so many reasons:
* You have to know the selectivity of what you’re looking for
* You have to be able to build different queries (with/without the hint) when your parameters change
* You have to be able to dramatically change the query (doubling the number of references to the Posts table) – this doesn’t work well with tools like EF or NHibernate
* You have to be positive that FORCE ORDER will work – its performance often falls apart catastrophically on queries with multiple joins
Neat idea though!
We are looking for a full-text based solution for a new app, usually we’re a SQL Server house but stacked up against other products it just doesn’t seem to cut the mustard.
Currently looking at:
– ElasticSearch (based on your recommendation above)
– Postgres: full-text searches in postgres seem way faster on a like-for-like search across the same data
– Combination of the above using ZomboDb
Hate to be nitpicky about typos (That’s not true, I actually love and enjoy pointing out these mistakes), but in the first script you filtered by titles containing “java”
SELECT TOP 100 *
FROM dbo.Posts
WHERE CreationDate BETWEEN ‘2013-01-01’ AND ‘2013-12-31’
AND Title LIKE ‘%java%’
ORDER BY CreationDate;
But in the next passage you wrote
“That’s awesome! […] to find 100 rows that had “SQL” in the title […]”
And the pasted query plan also filtered for ‘%SQL%’
That’s all, that’s my comment 🙂
Love what you’re doing here btw, your blog posts and videos/streams are super informative and entertaining at the same time.
Thanks, glad you like my work!
Thanks for the post. In my application the issue gets even worst when you have order by in the query as well. I did switch away from contains to containstable and the issue got resolved.
I’ve found on SQL Server 2016 Enterprise, full text searches on secondary replicas cause a LOT of redo thread contention. Is this something you have experienced yourself Brent?
If the searches are seldom – fine. If you get hundreds or more per minute, schema is locked for the runtime of each of those queries and that has lead to secondaries falling way behind.
That’s not just full text queries – any queries on a read-only replica will lock the schema while they’re running, just like they do on a primary. The table isn’t allowed to change underneath you.
My bigger question is, “What’s trying to constantly change the schema on the primary?” That question will lead you to the real villain in your environment. Hope that helps!
Running SQL 2019 in 2014 Compat mode, we see the slowness as well. However, hinting 150 compat level the execution time gets much better: OPTION(USE HINT(‘QUERY_OPTIMIZER_COMPATIBILITY_LEVEL_150’)
Now, if we could only safely switch to 2019 compat mode. LOL
Searching was killing us! We didn’t want anything to fancy, at the time, but wanted it to be fast. The search is for company names. e.g. ‘Ralphie Ralph LLC DBA Big as Texas Burgers, Buns, and Beer’. Substring searches were too slow and required the user to know the order of the words. Our solution was to parse each company name in our database into its component words. and store them with a reference to the company. not only did this speed up things, it made it much more flexible. A user could enter ‘”Big Bee bur” and get the company in the returned list. This was 12 years ago. Things are still working. We are working to beef it up, no pun intended,
8 years ago, we added an intermediary table to presort our searches by selectivity. We search for the most selective term and use the result as the criteria for the next level of searching. Each pass increments a counter on the company list generated by the first search. It is all based on the beginning of the words..so each search is a like ‘___%’. Time has caught up with us. I still have users using single letters to search for the above, but I also have customers with names like ‘b &b LLC’ . 2/3 of the companies have llc in their names.. so that’s useless. the rest of the query is to bring back every company that has a word that starts with ‘B’. I’m rewriting it because we did some very un-smart things. We have table vars and table-valued functions. I fixed all of that, Even the single-letter names work now. If we are saving a single letter, we pad it out to the letter repeated 5 times. same with two-letter words.. we just double them. This has eliminated most of the massive return sets. I haven’t released what I’ve done, so now while there’s still time, does anyone have a better idea? the concept is pretty similar to Pauls trigrams, but a little easier for this mear earthing to understand.