You’re a developer or a DBA, and you’re comfortable writing queries to get the data you need. You’re much less comfortable trying to design the right indexes for your database server.
In the next several posts, you’ll learn:
- The differences between clustered and nonclustered indexes
- How (and when) to make a covering index
- The basics of execution plans
If you prefer watching videos rather than reading, you can watch my How to Think Like the Engine training videos free too. I just decided to write these out as a series of blog posts because I get so many questions about how this stuff works – from folks who don’t like sitting around watching videos. Let’s get started!
I’m using the Users table in the Stack Overflow database.
StackOverflow.com open sources your question & answer data, and I import that into a few free SQL Server databases you can download. In this series, I’m going to use the small 10GB StackOverflow2010 database (1GB zip file) – if you want to follow along and get similar results & query plans, use that one.
The Users table holds exactly what you think it holds: a list of everybody who’s created an account at StackOverflow.com. The schema is pretty simple:
- Id – an identity field, starts at 1 and goes up to a bajillion
- DisplayName – not unique, just the name you go by, like “Brent Ozar” or “Alex”
- LastAccessDate – the last time you opened a page at StackOverflow.com
In SQL Server, most objects are saved in 8KB pages, and each 8KB page is dedicated entirely to one object. (We’ll save the nuances of columnstore indexes, Hekaton, and other niche objects for other blog posts.) You can think of these 8KB pages just like printed-out pages of an Excel spreadsheet, like this:
When I teach this class in person, I actually give out printed-out pieces of paper with this stuff. You can download that 5-page PDF and print it out to help follow along and visualize the data as we go through demos. Just stick with the white piece of paper at first.
That white grid above is the clustered index of the dbo.Users table. In the Stack Overflow databases that I publish, the Id column is set up as the clustered primary key of the table.
This clustered index on Id *is* the table.
Look down the left side of the sheet: the rows are sorted by Id, and the sheet has all of the columns in the table. The clustered index IS the table, and these 8KB pages are the same whether they’re on disk or in memory.
Each database’s data files (MDF/NDF) are just a string of 8KB pages from start to finish. Some of the pages are used by SQL Server to save metadata about the database, but most of ’em are your indexes.
You can see how many pages a table contains by running a query against it like this:
SET STATISTICS IO ON;
SELECT Id FROM dbo.Users;
The first line there, SET STATISTICS IO ON, adds info in SQL Server Management Studio’s “Messages” tab that tell you the number of 8KB pages SQL Server had to read to execute your query:
The “logical reads 7405” means SQL Server read 7,405 8KB pages. Generally speaking, the less pages SQL Server has to read to execute your query, the faster your query will go.
7,405 pages is about 15 reams of paper.
You know those 500-page packs of paper that you put into the copier or the printer? (No? Do you remember copiers and printers? Honestly, me neither.) The Users table is one of the smallest tables in the Stack Overflow database export, but it’s still 15 of those packs.
As we work through demos in the upcoming posts, I want you to visualize a stack of 15 reams of paper over in the corner of your room. When I ask you to query the table, I want you to think about how you’d execute that as a human being facing data spread across 15 reams of paper. It’d be a hell of a lot of work, and you wouldn’t be so eager to go grab the first piece of paper to start work. You’d wanna build a really good plan before you go tackle that stack of paper.
That’s why SQL Server tries to build good execution plans.
In SSMS, click Query, Include Actual Plan (or hit Control-M), run the query again, and you’ll get an “Execution plan” tab:
Read the plan from right to left: SQL Server scanned the clustered index (aka the white pages, the table), pushing 299,398 things out to the SELECT statement. It had to scan the whole table because we asked for the whole thing – well, we only asked for one column (Id), but the Ids were scattered across all the pages, so we had to read ’em all.
You wouldn’t normally write a query without a WHERE clause though, so let’s add one.
Let’s be good and add a WHERE clause.
Let’s only get the people who accessed the system after July 1, 2014:
SELECT Id FROM dbo.Users WHERE LastAccessDate > '2014/07/01';
And my execution plan looks the same:
(If you’re following along on your own computer, your query might look a little different – I’ll explain why in the next post.)
Both of my queries – with and without a where clause – read the same number of pages. Here, I’m running both queries back to back, showing their stats in a single Messages tab:
And if you look at the 8KB page again, it kinda makes sense why we have to scan the clustered index in order to find rows with a LastAccessDate > 2014/07/01:
Even though our query returns less rows, we still have to do the same amount of work because the data simply isn’t sorted in an order that helps our query. If we’re going to frequently query by LastAccessDate, and if we want that query to run faster, we’re going to need to help SQL Server out by designing an additional copy of our table, pre-sorted in a better way: a non-clustered index.
Disclaimer: I’m simplifying a lot of stuff here.
In this blog post series, I’m going to try to cover a lot of the most important ground, fast. To do that, I’m going to leave out a lot of details that you would probably find interesting – but look, this is a blog post, not a book. It’s a starting point for your learning journey.
However, there are a few disclaimers I need to make or else the armchair bloggers are going to complain:
- Note that the AboutMe column is clipped off. The AboutMe column is an NVARCHAR(MAX) because Stack Overflow lets you store a huge amount of stuff inside your profile. In SQL Server, large AboutMe data can be moved off-row: saved on separate 8KB pages. When you design tables with big columns like VARCHAR(MAX), NVARCHAR(MAX), XML, JSON, etc., you can end up paying a performance price as SQL Server jumps around to read all that data. Ideally, design columns just wide enough to store the data you need, and no more.
- The 8KB pages don’t use Excel-style grids. SQL Server needs to cram as much data in per page as it can, so no, it doesn’t organize rows and columns on the 8KB page like a spreadsheet. Similarly, the column headers aren’t stored in plain text on each page, nor are nulls aren’t stored with all capital NULL, hahaha. I’m just using a spreadsheet as a visualization tool because it helps you understand how things work.
- The 8KB page I’m showing is called a leaf page. Indexes have more kinds of pages too, like B-tree structures that help the engine rapidly find the right page when it’s seeking for a specific row. However, even if you just focus on the leaf pages and don’t learn anything else, you can still do a phenomenal job of improving your database performance – and that’s the goal of this series, getting you up to speed on the most important stuff quickly.
And there’s one more complex thing I need to tackle, but it needs its own blog post, which is the next one in the series: why your query plan might not have exactly matched mine.