Searching Strings in SQL Server is Expensive

In the classic spirit of my How to Think Like the Engine class, let’s go take a look at the StackOverflow.Users table and find all the users named Brent. There’s a DisplayName field, and I’m going to be querying that a lot in this blog post, so I’ll go ahead and create an index on it, then look for the Brents:

Part 1: looking for Brent

Part 1: looking for Brent

There’s 858 of us, and I can find them instantly. To get even more exact, we can turn on SET STATISTICS TIME, IO ON and then check out the Messages tab:

Woohoo! Nice and fast. Don’t get misled by “scan count 1” – it’s actually an index seek, seeking directly to the Brents in the table:

Beautiful!

Except – it’s wrong.

We're everywhere

We’re everywhere

Those aren’t all the Brents – these are only the people whose DisplayName BEGINS with Brent. There are others, like Sun Brent, DesignerBrent, Dan Brentley, Ben Brenton, and more.

So I’m not happy about this, but I’m going to have to change my query from DisplayName LIKE ‘Brent%’, and I’m going to have to use a leading wildcard.

How bad can it be?

It's only a second or two

It’s only a second or two

It’s actually not that bad – only takes about a second or two to find the 914 %Brent%’s. Sure, our beautiful, quick index seek has become an ugly scan of the entire index. No surprise there – we have to find people with Brent anywhere in their name, which means they could be anywhere in the index.

We already know that we’re going to be reading more data, and STATISTICS IO shows it:

Logical reads – the number of 8K pages we looked at in order to build our query results – went from 8, all the way up to 22,548. So that’s why the query’s taking 1.7 seconds now, right?

Not so fast.

Try the query without a where clause.

Literally, SELECT COUNT(*) FROM dbo.Users;

Wait, that was really fast

Wait, that was really fast

It takes no time at all – even though it’s returning 5.3mm users!

What do the output statistics say?

We still read over 22,000 pages – but why is it taking just 92 milliseconds to happen now?

It all comes down to the CPU work required to execute this line:

In order to do that, SQL Server burns 5.2 seconds of CPU time cracking open each DisplayName string, and moving through it looking for the pattern Brent anywhere in the string. You can see the effect of it really clearly anytime you execute this query:

CPU spikes to 100%

CPU spikes to 100%

CPU goes straight to 100% across all four of my cores for 1-2 seconds each time this query runs – and this is one of the tiniest tables in the StackOverflow database.

Leading wildcard searches – or anything to do with parsing strings – just aren’t SQL Server’s strong point, and will burn up a ton of CPU power. When you’re only parsing a few records, like when we looked for Brent%, it’s not that big of a deal. But if I have to parse the whole table, it’s a hot, expensive mess.

How do we know if our code has this problem?

Run the query with SET STATISTICS TIME ON, and examine the output. Look for the execution times line:

If CPU time is higher than elapsed time, that means your query went parallel – now, that alone isn’t a problem. Even if it didn’t, though – say you’ve got 10 seconds of elapsed time, and 9.5 seconds of that time is spent tearing up CPU. That’s bad.

When CPU time is unacceptably high, and you’re reading a relatively low number of pages (like in our case), you might have this problem. To figure it out, go through the functions and string comparisons in your query, strip them out, and see if the query suddenly runs fast.

So how do we fix it?

Option 1: live with it, and spend money on licensing. As your workload and your volume scales, performance will get worse, linearly. When you double the number of records you have to scan, CPU use will exactly double. The good news here is that as long as your queries don’t have parallelism inhibitors, then they’ll go parallel across multiple cores, and you can divide and conquer. It’s gonna be expensive:

  • SQL 2014 & prior: Standard Edition maxes out at 16 cores
  • SQL 2016: Standard Edition can do up to 24 cores
  • Beyond 16/24, you’re looking at SQL Server Enterprise Edition

Option B: fix the code. Stop letting users do leading wildcard searches.

But Brent, we have to let them do leading wildcards.

Sometimes the table design requires leading wildcards. For example, at StackOverflow.com, questions can be tagged with up to 5 tags to describe the question’s topic. Check it out in the database:

Tags concatenated into a single field

Tags concatenated into a single field

Yes, that’s how tags were initially stored in the StackOverflow database design. This meant that if you needed to search for questions tagged sql-server, you had to write:

That didn’t scale – especially for a crew whose motto is performance is a feature – so eventually Stack ended up moving on to…

Option III: create a child table. If you’re storing lists in a single field, break them apart into a separate child table. In Stack’s example, this might be a PostsTags table where:

  • Id – INT IDENTITY
  • PostId – INT, relates up to Posts
  • Tag – NVARCHAR(50) or whatever, the string we’re searching for

So now our sql-server search looks like this:

Presto, no string parsing, and the query can do an index seek (assuming that we properly index the PostsTags table.)

At first, this looks like an impossible task: we couldn’t possibly change all of the queries in our database to point to an entirely new table structure. Good news: you don’t have to! Here’s what the implementation process looks like:

  • Build the new tables
  • Write a trigger to keep them up to date (in this case, when Posts are inserted/updated/deleted, maintain the related records in the PostsTags tables)
  • Change only your most resource-intensive select queries over to use the new table structure (PostsTags)

Yes, this means data is stored twice, and yes, you have the overhead of a trigger. This solution only makes sense when the cost of the select queries has grown too CPU-intensive to maintain. We’re making the choice to rack up a little technical debt in order to pay down short term performance issues. Over time, we should probably resolve to change all queries that touch the Posts.Tags field, and have the app itself update the PostsTags table whenever it inserts/updates/deletes Posts. That’s a project management issue, not a DBA issue.

Why didn’t someone tell me before we designed this database?

Louis Davidson and Jessica Moss were trying to, but you haven’t been answering their Facebook friend requests. Instead, pick up their book, Pro SQL Server Relational Database Design.

I reference this exact book in the first module of every 3-day class that I teach. It’s in the intro module, The DBA Skills Quiz, where this book is the example I use for people who score a 4 on the “design tables” section. If you say you score 4 points on that question, you damn well better own this book and know what’s in it.

Previous Post
[Video] Office Hours 2016/10/12 (With Transcriptions)
Next Post
PasteThePlan Update: See the Most Recently Pasted Plans

16 Comments. Leave new

  • Curious what your thoughts are on using a Full Text index for this. Thanks!

    • Isaac – typically when I see leading wildcards used like we discuss in the post, it’s because someone is searching for a specific string. Examples include numbers in a list or part numbers. Full text indexing doesn’t help us much when we’re looking for, say, C# or .NET 3.5.

      • Jonathan Allen
        October 18, 2016 11:53 pm

        I’ve used full text search for partial address lookups to good effect. I even indexed the numeric parts of the address. The database included most addresses in the US.

  • It seems that if you don’t need the extra functionality of LIKE—you only need “begins-with” or “ends-with”, CHARINDEX is your friend: http://cc.davelozinski.com/sql/like-vs-substring-vs-leftright-vs-charindex Had I the time, I think it would be nice to see how “contains” could be thrown into the mix. I like optimizing things, but the reality is that the “Option III” solution allows Sql Server more flexibility in solving the problem for me. I can’t always write the fastest, coolest statements on the planet, and it’s not a one-man-show.

  • Matt Salsbury
    October 18, 2016 1:28 pm

    If you want a really egregious use of Like, consider this….

    A vendor brought in a new developer. His first assignment was coding a function to determine if a row, that he already had the surrogate key for, still existed in a table.

    His solution was to load the surrogate key values for all rows into a SQL variable with a cursor. The values were concatenated with nothing separating them. This was a table with about 750,000 records.

    Naturally he searched it with a Like and couldn’t understand it wouldn’t return in less than 30 minutes. He also couldn’t understand why it sometimes returned incorrect results.

  • Brent – another great and informative post. I’ll be buying that book for sure.
    Steve H

  • Stephen Falken
    October 18, 2016 5:20 pm
  • I think in this example you aren’t really wanting to look for “%Brent%” – i.e. you could skip DesignerBrent; maybe you would also skip “Brenton” and “Brentley” – you just want to search words, within the data. Some years ago we introduced a Keywords table to our APPs DB, as an alternative to Full Text. Our table has columns for SourceColumnID (e.g. Address-1, Address-2 as well as FirsrtName … but also ProductName, ProductDescription etc.), RowID (the PKey of the row, in the source table) and Keyword. We split the original text on spaces, hyphens and so on. We also introduced a table of Synonyms – to STEM words, and also for competitors product codes/names and the like. Operator can mark a row in the table as disabled (e.g. “This product doesn’t not work with WIDGET” could be set not to disable the entry for “WIDGET”), and so on. Yeah, fair bit of work to set up initially, but it gives us good control, and performance, over predictive-text type AJAX searches and looking for “Brent” or even “Brent%” is a breeze – and will find all the embedded “words” that match/start-with. Easy also so search multiple columns – WHERE SourceID IN (Address-1, Address-2, …) AND Keyword = ‘Brent’. Also works for “c#”, and even “the” if you need to!

  • Would it be better to store the [Tag] field as typed XML column instead of just concatenating strings?

    • Andrew Miller
      October 19, 2016 8:08 am

      Brent, like Alex, I’d be interested in your reply too. From a DBA’s perspective, I’ve leaned towards XML for exactly this sort of thing instead of horrible things like NVARCHAR(MAX) comma separated values – which are often supported by dreaded Scalar Valued UDFs… my thinking is at least there is “some structure” which can be backed by a XML schema…

      • Andrew & Alex – one of the things I really love about the open source StackOverflow database is that instead of asking for opinions, you can go download the database and try your ideas yourself. Running experiments is a great way to learn about the performance overhead of various SQL Server features. That advice is one of the best things I ever heard from Paul Randal, too – he basically said, if you can figure it out yourself, you should. You’ll learn so much on that journey instead of just taking someone else’s word.

        Go grab the database and give it a shot! Then share your findings with the world.

  • You never gave a solution to the original problem of searching for Brent – how does a child table help in that scenario?

    • Graham – there’s only so far I can go in a blog post. (I’m already well over 1,000 words.) Your best bet there will be to download the Stack Overflow database yourself and give ‘er a shot. Thanks!

Menu
{"cart_token":"","hash":"","cart_data":""}