In the classic spirit of my How to Think Like the Engine class, let’s go take a look at the StackOverflow.Users table and find all the users named Brent. There’s a DisplayName field, and I’m going to be querying that a lot in this blog post, so I’ll go ahead and create an index on it, then look for the Brents:
There’s 858 of us, and I can find them instantly. To get even more exact, we can turn on SET STATISTICS TIME, IO ON and then check out the Messages tab:
SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 1 ms. (1 row(s) affected) Table 'Users'. Scan count 1, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms.
Woohoo! Nice and fast. Don’t get misled by “scan count 1” – it’s actually an index seek, seeking directly to the Brents in the table:
Except – it’s wrong.
Those aren’t all the Brents – these are only the people whose DisplayName BEGINS with Brent. There are others, like Sun Brent, DesignerBrent, Dan Brentley, Ben Brenton, and more.
So I’m not happy about this, but I’m going to have to change my query from DisplayName LIKE ‘Brent%’, and I’m going to have to use a leading wildcard.
How bad can it be?
It’s actually not that bad – only takes about a second or two to find the 914 %Brent%’s. Sure, our beautiful, quick index seek has become an ugly scan of the entire index. No surprise there – we have to find people with Brent anywhere in their name, which means they could be anywhere in the index.
We already know that we’re going to be reading more data, and STATISTICS IO shows it:
SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 3 ms. (1 row(s) affected) Table 'Users'. Scan count 5, logical reads 22548, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 5235 ms, elapsed time = 1672 ms.
Logical reads – the number of 8K pages we looked at in order to build our query results – went from 8, all the way up to 22,548. So that’s why the query’s taking 1.7 seconds now, right?
Not so fast.
Try the query without a where clause.
Literally, SELECT COUNT(*) FROM dbo.Users;
It takes no time at all – even though it’s returning 5.3mm users!
What do the output statistics say?
SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 0 ms. (1 row(s) affected) Table 'Users'. Scan count 5, logical reads 22637, physical reads 0, read-ahead reads 17, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 344 ms, elapsed time = 92 ms.
We still read over 22,000 pages – but why is it taking just 92 milliseconds to happen now?
It all comes down to the CPU work required to execute this line:
WHERE DisplayName LIKE '%Brent%'
In order to do that, SQL Server burns 5.2 seconds of CPU time cracking open each DisplayName string, and moving through it looking for the pattern Brent anywhere in the string. You can see the effect of it really clearly anytime you execute this query:
CPU goes straight to 100% across all four of my cores for 1-2 seconds each time this query runs – and this is one of the tiniest tables in the StackOverflow database.
Leading wildcard searches – or anything to do with parsing strings – just aren’t SQL Server’s strong point, and will burn up a ton of CPU power. When you’re only parsing a few records, like when we looked for Brent%, it’s not that big of a deal. But if I have to parse the whole table, it’s a hot, expensive mess.
How do we know if our code has this problem?
Run the query with SET STATISTICS TIME ON, and examine the output. Look for the execution times line:
SQL Server Execution Times: CPU time = 5235 ms, elapsed time = 1672 ms.
If CPU time is higher than elapsed time, that means your query went parallel – now, that alone isn’t a problem. Even if it didn’t, though – say you’ve got 10 seconds of elapsed time, and 9.5 seconds of that time is spent tearing up CPU. That’s bad.
When CPU time is unacceptably high, and you’re reading a relatively low number of pages (like in our case), you might have this problem. To figure it out, go through the functions and string comparisons in your query, strip them out, and see if the query suddenly runs fast.
So how do we fix it?
Option 1: live with it, and spend money on licensing. As your workload and your volume scales, performance will get worse, linearly. When you double the number of records you have to scan, CPU use will exactly double. The good news here is that as long as your queries don’t have parallelism inhibitors, then they’ll go parallel across multiple cores, and you can divide and conquer. It’s gonna be expensive:
- SQL 2014 & prior: Standard Edition maxes out at 16 cores
- SQL 2016: Standard Edition can do up to 24 cores
- Beyond 16/24, you’re looking at SQL Server Enterprise Edition
Option B: fix the code. Stop letting users do leading wildcard searches.
But Brent, we have to let them do leading wildcards.
Sometimes the table design requires leading wildcards. For example, at StackOverflow.com, questions can be tagged with up to 5 tags to describe the question’s topic. Check it out in the database:
Yes, that’s how tags were initially stored in the StackOverflow database design. This meant that if you needed to search for questions tagged sql-server, you had to write:
SELECT * FROM dbo.Posts WHERE Tags LIKE '%<sql-server>%'
That didn’t scale – especially for a crew whose motto is performance is a feature – so eventually Stack ended up moving on to…
Option III: create a child table. If you’re storing lists in a single field, break them apart into a separate child table. In Stack’s example, this might be a PostsTags table where:
- Id – INT IDENTITY
- PostId – INT, relates up to Posts
- Tag – NVARCHAR(50) or whatever, the string we’re searching for
So now our sql-server search looks like this:
SELECT * FROM dbo.Posts p INNER JOIN dbo.PostsTags pt ON p.Id = pt.PostId WHERE pt.Tag = 'sql-server'
Presto, no string parsing, and the query can do an index seek (assuming that we properly index the PostsTags table.)
At first, this looks like an impossible task: we couldn’t possibly change all of the queries in our database to point to an entirely new table structure. Good news: you don’t have to! Here’s what the implementation process looks like:
- Build the new tables
- Write a trigger to keep them up to date (in this case, when Posts are inserted/updated/deleted, maintain the related records in the PostsTags tables)
- Change only your most resource-intensive select queries over to use the new table structure (PostsTags)
Yes, this means data is stored twice, and yes, you have the overhead of a trigger. This solution only makes sense when the cost of the select queries has grown too CPU-intensive to maintain. We’re making the choice to rack up a little technical debt in order to pay down short term performance issues. Over time, we should probably resolve to change all queries that touch the Posts.Tags field, and have the app itself update the PostsTags table whenever it inserts/updates/deletes Posts. That’s a project management issue, not a DBA issue.
Why didn’t someone tell me before we designed this database?
Louis Davidson and Jessica Moss were trying to, but you haven’t been answering their Facebook friend requests. Instead, pick up their book:
- Pro SQL Server Relational Database Design – new version of their classic book, available for preorder now and releases in December
- If you just can’t wait, here’s the SQL Server 2012 version
I reference this exact book in the first module of every 4-day class that I teach. I give my attendees an abbreviated version of my quiz, How to Become a Senior DBA, and this book is the example I use for people who score a 4 on the “design tables” section. If you say you score 4 points on that question, you damn well better own this book and know what’s in it.