DBA Training Plan 7: How SQL Server Stores Stuff in MDF Files

Last Updated 5 years ago

So far in the DBA Training Plan, we’ve been working hard to make sure the contents of our MDF files are backed up, corruption-free, and secured. Now, uh…what’s actually inside them?

Inside each of your database data files (MDFs, although technically you can use any extension you want, even PDF or MP3), SQL Server stores your data in 8KB pages. That’s kilobytes – not megabytes, not gigabytes, but just mere kilobytes.

Say that we create a table:

CREATE TABLE dbo.Readers
(ReaderID INT IDENTITY(1,1),
ReaderName VARCHAR(200));

CREATE TABLE dbo.Readers

(ReaderID INT IDENTITY(1,1),

ReaderName VARCHAR(200));

First off, yes, I understand, I shouldn’t do ReaderName in one field – I should have FirstName, LastName, MiddleName, Suffix, Prefix, yadda yadda yadda, but I’m trying to keep this short. Now see what you did? It’s long again. Doggone it, it’s hard teaching this stuff quickly.

Anyhoo, in this table, each record takes up just a little bit of space. ReaderID is an INTeger, which takes 4 bytes. It’s the same 4 bytes whether that number is 1 or 1,000,000. ReaderName is a VARCHAR(200), which means we can store up to 200 characters in here, and each character takes a byte. If we insert ‘BRENT OZAR’, that’s 10 characters (and boy, am I a character), so we need 10 bytes to store it.

If all of our readers average about 10 characters in their name, that means we could fit about 500-600 records per 8KB database page. (In reality, there’s some overhead because SQL Server also needs to use some parts of the page to store metadata, and we’ll talk about that in another post.)

This blog isn’t very popular, so we can keep all of our readers on a single 8KB page. As we insert, update, and delete readers, SQL Server fetches that 8KB page off disk, brings it into memory, makes the necessary changes, and then writes that data page back to disk. The 8KB page itself is the smallest unit that SQL Server will cache – it doesn’t cache individual rows/records – and each page belongs exclusively to just one object.

A Word About Objects

You’ll notice that I avoid using the word “table”. Tables are cool, but as we start to dig into what SQL Server’s doing under the hood, we want to start thinking about these three object types:

Heap – a table with no clustered index. In my dbo.Readers table, I didn’t specify in what order SQL Server should store my data, so it’s just going to slap the data down on my 8KB page in any old order.

Clustered Index – what we normally think of as a table. If I’d have created my table like this:

CREATE TABLE dbo.Readers
(ReaderID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
ReaderName VARCHAR(200));

CREATE TABLE dbo.Readers

(ReaderID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,

ReaderName VARCHAR(200));

Then SQL Server would store my data in order of ReaderID. That way, when I search for ReaderID #42, it can look at a little directory (B-tree) that tells it where ReaderID #42 is, and it can jump there without scanning through all three of my readers. The clustered index is sorted in the order of the ReaderID field, but it’s actually the full copy of our table, including all of our fields – in this case, just ReaderName.

Nonclustered index – If I tell SQL Server to:

CREATE NONCLUSTERED INDEX IX_ReaderName
ON dbo.Readers(ReaderName)

1 2	CREATE NONCLUSTERED INDEX IX_ReaderName ON dbo.Readers(ReaderName)

Then SQL Server will create a second copy of my table sorted by ReaderName. This copy of my table will only include the fields specified in my index (ReaderName), plus whatever fields it needs to get back to the clustered index or heap.

All three of these objects – heaps, clustered indexes, and nonclustered indexes – will be stored on separate sets of pages. We won’t have the clustered index and nonclustered index for the same table on the same page – they’re split. That’s why when we’re doing space analysis, we have to think in terms of indexes, not tables.

Pages & Rows on Disk

The sys.dm_db_index_physical stats Dynamic Management Function (DMF) returns the number of rows and pages stored in each database object. It takes parameters for database ID and object ID, or you can pass in NULLs to get information across all of your database objects. Scroll down to the examples link in that Books Online page, and you’ll get queries to examine a single table – I’d strongly recommend starting with a small table, because some of the parameters for this DMF will cause SQL Server to actually look at every page in the object. That means if all of the pages for that object aren’t cached in memory, SQL Server will go pull those pages from disk, and that can slow down your running SQL Server.

This DMF also includes average record size and max record size. This makes for fun spelunking: how big is each record, really? Just because we make everything a VARCHAR(8000) doesn’t mean we’re actually storing 8,000 characters in each field. Now, don’t go changing your database structure just yet – you can easily break applications when datatypes change. Let’s leave that for later.

You can get similar metadata much faster by using sp_spaceused, but it doesn’t get the cool info about average record size, and I wanted to encourage you to go spelunking here.

Your Homework: Watch a Free One-Hour Class

In my free class How to Think Like the Engine, I explain pages, indexes, joins, SARGability, and more. I use real database pages from the StackOverflow.com database for demos, and you actually get PDFs to print out and follow along as we go.

Print out this 5-page PDF to follow along – trust me, you’ll need these pages – and then here’s the first module of the class:

When you’re ready, move on to the next module.

As we start to move up the Hierarchy of Needs from capacity into performance, you’ll start to see how these foundational items are so important.

So far, we’ve been talking about the general structure of databases. In the next episode, we’ll get specific about what’s in your database.

DBA Training Plan 6: How to Monitor Drive Space

What The Consultant Toolkit Does

6 Comments. Leave new

Josh
July 31, 2019 1:13 pm

Hey this blog is popular with me, it’s been my saving grace for picking up stuff as a developer-made-DBA. Awesome content as always Brent, I appreciate the informative posts!

Reply
- Brent Ozar
  July 31, 2019 3:00 pm
  
  Awww, thanks!
  
  Reply
Jay Jadagu
July 31, 2019 6:01 pm

You are a genius. I want to learn more about clustered and nonclustred indexes.

Reply
Ajith Bhojani
July 31, 2019 6:44 pm

Thanks a highly informative post. Plus yesterday’s about the CU helped me out with a work issue. Cheers mate.

Reply
Phillip Putzback
August 1, 2019 8:05 am

Based on this statement : “The clustered index is sorted in the order of the ReaderID field, but it’s actually the full copy of our table, including all of our fields”.
Does that mean that the full heap exists and a sorted copy of the full heap? So if I am to copy a table one to one from a source system to be used in an enterprise warehouse, should I forgo the Primary key if it isn’t used to filter the data into the integration layer? Should I stick with Non clustered indexes? That process of building a Primary Key everyday on some of our large tables that can’t be run as incremental loads, is a costly process (Time cost).

Reply
- Brent Ozar
  August 1, 2019 11:11 am
  
  Phillip – you either have a clustered index or a heap, but not both. Go ahead and watch the video in the post, plus the rest of the free videos in the How to Think Like the Engine class, to learn more.
  
  Reply