Let’s step back and take a look at the big picture. (Today, I’m writing for beginners, so you advanced gurus can go ahead and close the browser now. I’m going to simplify things and leave a lot out in order to get some main points across. Don’t well-actually me.)
Microsoft SQL Server databases are stored on disk in two files: a data file and a log file.
What’s Stored in the Data File (MDF)
Let’s start with a simple table. If you want to follow along with my code, this will work on SQL Server 2005 & newer, but please do it in a brand new database rather than reusing one of your existing ones. We’ll be looking at the log file later, and you won’t be able to quickly find the relevant entries in a sea of unrelated ones. Off we go:
CREATE TABLE dbo.Friends (id INT IDENTITY(1,1), FriendName VARCHAR(30)); GO INSERT dbo.Friends (FriendName) VALUES ('Brent Ozar'); INSERT dbo.Friends (FriendName) VALUES ('Jeremiah Peschka'); INSERT dbo.Friends (FriendName) VALUES ('Jes Schultz Borland'); INSERT dbo.Friends (FriendName) VALUES ('Kendra Little'); GO
We now have a table. I was going to say you’ve got four friends, but we’re not your friends. Let’s take this slow, alright? We just met. You can start by buying us a drink first. Let’s see how the table is stored in SQL Server behind the scenes – look under the table, as it were:
DBCC IND('MyDatabaseName', 'Friends', -1);
This command is totally safe to run – it just lists out where SQL Server is storing your data. Replace ‘MyDatabaseName’ with your database’s name. The result is a list of pages where SQL Server stored the Friends table:
Data Files Are Broken Up Into 8KB Pages
These pages are the smallest unit of storage both in memory and on disk. When we write the very first row into a table, SQL Server allocates an 8KB page to store that row – and maybe a few more rows, depending on the size of our data. In our Friends example, each of our rows is small, so we can cram a bunch of ‘em onto a page. If we had bigger rows, they might take up multiple pages even just to store one row. For example, if you added a VARCHAR(MAX) field and stuffed it with data, it would span multiple pages.
Each page is dedicated to just one table. If we add several different small tables, they’ll each be stored on their own pages, even if they’re really small tables.
If we shut down the SQL Server, started it back up again, and then issued the following query:
SELECT * FROM dbo.Friends WHERE FriendName = 'Brent Ozar'
SQL Server would check to see what page the dbo.Friends table is on, then read our entire 8KB page from disk, and cache that 8KB page in memory. I say “entire” as if it’s a big deal, but I want to make a point here: pages are stored identically both in memory and on disk, and they’re the smallest unit of caching. If you use SQL Server’s data compression, the data isn’t uncompressed from the page until it needs to be read again to satisfy another query – you get the benefit of compression in memory as well as on disk.
What happens if we change a data page? For example, if we issue the following command, what happens:
INSERT dbo.Friends (FriendName) VALUES ('Lady Gaga');
That’s where the log file comes in.
What’s Stored in the Log File (LDF)
The log file is a sequential record of what we did to the data. SQL Server writes down, start to finish, what we’re trying to do to those helpless, innocent data pages.
Your first reaction is probably, “Wow, I never want to look in there because my users do horrible, unspeakable things to my database server.” Good news – SQL Server doesn’t need to log the SELECT statements because we’re not affecting the data, and that’s usually where the worst nastiness happens. Bad news – even if you did want to look in the log file, SQL Server doesn’t give you an easy way to do it. The log file exists for SQL Server, not for you.
When we insert, update, or delete rows in our table, SQL Server first writes that activity into the log file (LDF). The log file must get hardened to disk before SQL Server says the transaction is committed.
But not the change to the data page – that part doesn’t have to hit the disk right away. See, SQL Server knows you’re the kind of person who makes lots of changes to the same data, over and over. You’re a busy person with things to do and data to trash. SQL Server can keep the same data page in memory for a while, and then flush it out to disk later – as long as the log file was written.
When Windows crashes hard or somebody pulls the power cables out from under your SQL Server, SQL Server will use the database’s log file on startup. SQL uses the log file to reconcile the state of the data file, deciding which transactions should be applied to the data file and which ones should be rolled back.
How the Data File and Log File are Accessed
This starts to point to a significant storage difference between these two files.
Log files are written to sequentially, start to finish. SQL Server doesn’t jump around – it just makes a little to-do list and keeps right on going. Eventually when it reaches the end of the log file, it’ll either circle back around to the beginning and start again, or it’ll add additional space at the end and keep on writing. Either way, though, we’re talking about sequential writes. It’s not that we never read the log file – we do, like when we perform transaction log backups. However, these are the exception rather than the norm.
Data files, on the other hand, are a jumbled mess of stuff. You’ve got tables and pages all over the place, and your users are making unpredictable changes all over the place. SQL Server’s access for data files tends to be random, and it’s a combination of both reads and writes. The more memory your server has, the less data file reads happen – SQL Server will cache the data pages in memory and just work off that cache rather than reading over and over. This is why we often suggest stuffing your SQL Server with as much memory as you can afford; it’s cheaper than buying good storage.
Learning More About SQL Server Storage
Here’s a 30-minute webcast where I discuss data and log file basics:
And if I were you, here’s the questions I’d be asking next:
- If I add a clustered index, does that change how the data is stored?
- Do I need to worry about putting the pages in order, aka defragmenting my indexes?
- How do these concepts map up to storage best practices for SQL Server?
And here’s resources to play around with:
- Merrill Aldrich’s SQL Server File Layout Viewer plus Part 2 and Part 3
- Book: SQL 2008 Internals – if you really do want to go spelunking in page structures, this is your book. This isn’t an instructional book for how to be a better developer or DBA – it focuses purely on internals. If you’re looking for howto books, check out my SQL Server book recommendations.
- Our SQL Server Training for Developers – we’re doing a 2-day course in Atlanta this May focusing on SQL Server indexes, queries, and performance.