Searching Strings in SQL Server is Expensive

Last Updated June 25, 2018

In the classic spirit of my How to Think Like the Engine class, let’s go take a look at the StackOverflow.Users table and find all the users named Brent. There’s a DisplayName field, and I’m going to be querying that a lot in this blog post, so I’ll go ahead and create an index on it, then look for the Brents:

There’s 858 of us, and I can find them instantly. To get even more exact, we can turn on SET STATISTICS TIME, IO ON and then check out the Messages tab:

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 1 ms.

(1 row(s) affected)
Table 'Users'. Scan count 1, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

SQL Server parse and compile time:

CPU time = 0 ms, elapsed time = 1 ms.

(1 row(s) affected)

Table 'Users'. Scan count 1, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

Woohoo! Nice and fast. Don’t get misled by “scan count 1” – it’s actually an index seek, seeking directly to the Brents in the table:

Beautiful!

Except – it’s wrong.

Those aren’t all the Brents – these are only the people whose DisplayName BEGINS with Brent. There are others, like Sun Brent, DesignerBrent, Dan Brentley, Ben Brenton, and more.

So I’m not happy about this, but I’m going to have to change my query from DisplayName LIKE ‘Brent%’, and I’m going to have to use a leading wildcard.

How bad can it be?

It's only a second or two — It’s only a second or two

It’s actually not that bad – only takes about a second or two to find the 914 %Brent%’s. Sure, our beautiful, quick index seek has become an ugly scan of the entire index. No surprise there – we have to find people with Brent anywhere in their name, which means they could be anywhere in the index.

We already know that we’re going to be reading more data, and STATISTICS IO shows it:

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 3 ms.

(1 row(s) affected)
Table 'Users'. Scan count 5, logical reads 22548, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 5235 ms,  elapsed time = 1672 ms.

SQL Server parse and compile time:

CPU time = 0 ms, elapsed time = 3 ms.

(1 row(s) affected)

Table 'Users'. Scan count 5, logical reads 22548, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 5235 ms, elapsed time = 1672 ms.

Logical reads – the number of 8K pages we looked at in order to build our query results – went from 8, all the way up to 22,548. So that’s why the query’s taking 1.7 seconds now, right?

Not so fast.

Try the query without a where clause.

Literally, SELECT COUNT(*) FROM dbo.Users;

It takes no time at all – even though it’s returning 5.3mm users!

What do the output statistics say?

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

(1 row(s) affected)
Table 'Users'. Scan count 5, logical reads 22637, physical reads 0, read-ahead reads 17, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 344 ms,  elapsed time = 92 ms.

SQL Server parse and compile time:

CPU time = 0 ms, elapsed time = 0 ms.

(1 row(s) affected)

Table 'Users'. Scan count 5, logical reads 22637, physical reads 0, read-ahead reads 17, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 344 ms, elapsed time = 92 ms.

We still read over 22,000 pages – but why is it taking just 92 milliseconds to happen now?

It all comes down to the CPU work required to execute this line:

WHERE DisplayName LIKE '%Brent%'

1	WHERE DisplayName LIKE '%Brent%'

In order to do that, SQL Server burns 5.2 seconds of CPU time cracking open each DisplayName string, and moving through it looking for the pattern Brent anywhere in the string. You can see the effect of it really clearly anytime you execute this query:

CPU goes straight to 100% across all four of my cores for 1-2 seconds each time this query runs – and this is one of the tiniest tables in the StackOverflow database.

Leading wildcard searches – or anything to do with parsing strings – just aren’t SQL Server’s strong point, and will burn up a ton of CPU power. When you’re only parsing a few records, like when we looked for Brent%, it’s not that big of a deal. But if I have to parse the whole table, it’s a hot, expensive mess.

How do we know if our code has this problem?

Run the query with SET STATISTICS TIME ON, and examine the output. Look for the execution times line:

SQL Server Execution Times:
   CPU time = 5235 ms,  elapsed time = 1672 ms.

1 2	SQL Server Execution Times: CPU time = 5235 ms, elapsed time = 1672 ms.

If CPU time is higher than elapsed time, that means your query went parallel – now, that alone isn’t a problem. Even if it didn’t, though – say you’ve got 10 seconds of elapsed time, and 9.5 seconds of that time is spent tearing up CPU. That’s bad.

When CPU time is unacceptably high, and you’re reading a relatively low number of pages (like in our case), you might have this problem. To figure it out, go through the functions and string comparisons in your query, strip them out, and see if the query suddenly runs fast.

So how do we fix it?

Option 1: live with it, and spend money on licensing. As your workload and your volume scales, performance will get worse, linearly. When you double the number of records you have to scan, CPU use will exactly double. The good news here is that as long as your queries don’t have parallelism inhibitors, then they’ll go parallel across multiple cores, and you can divide and conquer. It’s gonna be expensive:

SQL 2014 & prior: Standard Edition maxes out at 16 cores
SQL 2016: Standard Edition can do up to 24 cores
Beyond 16/24, you’re looking at SQL Server Enterprise Edition

Option B: fix the code. Stop letting users do leading wildcard searches.

But Brent, we have to let them do leading wildcards.

Sometimes the table design requires leading wildcards. For example, at StackOverflow.com, questions can be tagged with up to 5 tags to describe the question’s topic. Check it out in the database:

Yes, that’s how tags were initially stored in the StackOverflow database design. This meant that if you needed to search for questions tagged sql-server, you had to write:

SELECT * FROM dbo.Posts
WHERE Tags LIKE '%<sql-server>%'

1 2	SELECT * FROM dbo.Posts WHERE Tags LIKE '%<sql-server>%'

That didn’t scale – especially for a crew whose motto is performance is a feature – so eventually Stack ended up moving on to…

Option III: create a child table. If you’re storing lists in a single field, break them apart into a separate child table. In Stack’s example, this might be a PostsTags table where:

Id – INT IDENTITY
PostId – INT, relates up to Posts
Tag – NVARCHAR(50) or whatever, the string we’re searching for

So now our sql-server search looks like this:

SELECT *
FROM dbo.Posts p
INNER JOIN dbo.PostsTags pt ON p.Id = pt.PostId
WHERE pt.Tag = 'sql-server'

SELECT *

FROM dbo.Posts p

INNER JOIN dbo.PostsTags pt ON p.Id = pt.PostId

WHERE pt.Tag = 'sql-server'

Presto, no string parsing, and the query can do an index seek (assuming that we properly index the PostsTags table.)

At first, this looks like an impossible task: we couldn’t possibly change all of the queries in our database to point to an entirely new table structure. Good news: you don’t have to! Here’s what the implementation process looks like:

Build the new tables
Write a trigger to keep them up to date (in this case, when Posts are inserted/updated/deleted, maintain the related records in the PostsTags tables)
Change only your most resource-intensive select queries over to use the new table structure (PostsTags)

Yes, this means data is stored twice, and yes, you have the overhead of a trigger. This solution only makes sense when the cost of the select queries has grown too CPU-intensive to maintain. We’re making the choice to rack up a little technical debt in order to pay down short term performance issues. Over time, we should probably resolve to change all queries that touch the Posts.Tags field, and have the app itself update the PostsTags table whenever it inserts/updates/deletes Posts. That’s a project management issue, not a DBA issue.

Why didn’t someone tell me before we designed this database?

Louis Davidson and Jessica Moss were trying to, but you haven’t been answering their Facebook friend requests. Instead, pick up their book, Pro SQL Server Relational Database Design.

I reference this exact book in the first module of every 3-day class that I teach. It’s in the intro module, The DBA Skills Quiz, where this book is the example I use for people who score a 4 on the “design tables” section. If you say you score 4 points on that question, you damn well better own this book and know what’s in it.

[Video] Office Hours 2016/10/12 (With Transcriptions)

PasteThePlan Update: See the Most Recently Pasted Plans

17 Comments. Leave new

Isaac Koss
October 18, 2016 9:37 am

Curious what your thoughts are on using a Full Text index for this. Thanks!

Reply
- Brent Ozar
  October 18, 2016 11:20 am
  
  Isaac – typically when I see leading wildcards used like we discuss in the post, it’s because someone is searching for a specific string. Examples include numbers in a list or part numbers. Full text indexing doesn’t help us much when we’re looking for, say, C# or .NET 3.5.
  
  Reply
  - Jonathan Allen
    October 18, 2016 11:53 pm
    
    I’ve used full text search for partial address lookups to good effect. I even indexed the numeric parts of the address. The database included most addresses in the US.
    
    Reply
Granger
October 18, 2016 10:48 am

It seems that if you don’t need the extra functionality of LIKE—you only need “begins-with” or “ends-with”, CHARINDEX is your friend: http://cc.davelozinski.com/sql/like-vs-substring-vs-leftright-vs-charindex Had I the time, I think it would be nice to see how “contains” could be thrown into the mix. I like optimizing things, but the reality is that the “Option III” solution allows Sql Server more flexibility in solving the problem for me. I can’t always write the fastest, coolest statements on the planet, and it’s not a one-man-show.

Reply
- Brent Ozar
  October 18, 2016 11:20 am
  
  Granger – right, but like this post says, the problem isn’t when we do begins-with. It’s when we do %LIKE%.
  
  Reply
Matt Salsbury
October 18, 2016 1:28 pm

If you want a really egregious use of Like, consider this….

A vendor brought in a new developer. His first assignment was coding a function to determine if a row, that he already had the surrogate key for, still existed in a table.

His solution was to load the surrogate key values for all rows into a SQL variable with a cursor. The values were concatenated with nothing separating them. This was a table with about 750,000 records.

Naturally he searched it with a Like and couldn’t understand it wouldn’t return in less than 30 minutes. He also couldn’t understand why it sometimes returned incorrect results.

Reply
- Brent Ozar
  October 18, 2016 1:30 pm
  
  Matt – HAHAHA, awesome, I was laughing out loud before I even finished the comment. Nice.
  
  Reply
Steve hoban
October 18, 2016 3:39 pm

Brent – another great and informative post. I’ll be buying that book for sure.
Steve H

Reply
Stephen Falken
October 18, 2016 5:20 pm

Calling all Carlos… https://www.brentozar.com/archive/2010/06/sargable-why-string-is-slow/

Reply
Kristen
October 19, 2016 2:22 am

I think in this example you aren’t really wanting to look for “%Brent%” – i.e. you could skip DesignerBrent; maybe you would also skip “Brenton” and “Brentley” – you just want to search words, within the data. Some years ago we introduced a Keywords table to our APPs DB, as an alternative to Full Text. Our table has columns for SourceColumnID (e.g. Address-1, Address-2 as well as FirsrtName … but also ProductName, ProductDescription etc.), RowID (the PKey of the row, in the source table) and Keyword. We split the original text on spaces, hyphens and so on. We also introduced a table of Synonyms – to STEM words, and also for competitors product codes/names and the like. Operator can mark a row in the table as disabled (e.g. “This product doesn’t not work with WIDGET” could be set not to disable the entry for “WIDGET”), and so on. Yeah, fair bit of work to set up initially, but it gives us good control, and performance, over predictive-text type AJAX searches and looking for “Brent” or even “Brent%” is a breeze – and will find all the embedded “words” that match/start-with. Easy also so search multiple columns – WHERE SourceID IN (Address-1, Address-2, …) AND Keyword = ‘Brent’. Also works for “c#”, and even “the” if you need to!

Reply
- Brent Ozar
  October 19, 2016 9:50 am
  
  Kristen – nope, as I said, I did indeed want to look for %Brent%. I like how you try to revise my requirements though. 😉
  
  Reply
Alex
October 19, 2016 4:23 am

Would it be better to store the [Tag] field as typed XML column instead of just concatenating strings?

Reply
- Andrew Miller
  October 19, 2016 8:08 am
  
  Brent, like Alex, I’d be interested in your reply too. From a DBA’s perspective, I’ve leaned towards XML for exactly this sort of thing instead of horrible things like NVARCHAR(MAX) comma separated values – which are often supported by dreaded Scalar Valued UDFs… my thinking is at least there is “some structure” which can be backed by a XML schema…
  
  Reply
  - Brent Ozar
    October 19, 2016 9:50 am
    
    Andrew & Alex – one of the things I really love about the open source StackOverflow database is that instead of asking for opinions, you can go download the database and try your ideas yourself. Running experiments is a great way to learn about the performance overhead of various SQL Server features. That advice is one of the best things I ever heard from Paul Randal, too – he basically said, if you can figure it out yourself, you should. You’ll learn so much on that journey instead of just taking someone else’s word.
    
    Go grab the database and give it a shot! Then share your findings with the world.
    
    Reply
Graham
May 28, 2019 8:00 am

You never gave a solution to the original problem of searching for Brent – how does a child table help in that scenario?

Reply
- Brent Ozar
  May 28, 2019 8:02 am
  
  Graham – there’s only so far I can go in a blog post. (I’m already well over 1,000 words.) Your best bet there will be to download the Stack Overflow database yourself and give ‘er a shot. Thanks!
  
  Reply
Finding Tagged Questions Faster: Answers & Discussion - Brent Ozar Unlimited®
February 2, 2024 11:38 am

[…] That works really well for common tags like sql-server, but the rarer of a tag we’re looking for, the slower the query is going to get. It takes longer to scan through the entire table, hoping to find 100 rows that match, checking the contents of each Tags, because searching strings in SQL Server is expensive. […]

Reply