When the database administrator turns in her notice, what questions should you ask her in her last couple of weeks?
I’m assuming, of course, that it was a friendly departure and you’ve got the full two weeks to have good conversations.
Or maybe the DBA is about to go on vacation, and you need to quickly get up to speed about things that might break while they’re gone.
One of my clients’ DBAs turned in their resignation, so I was presented with this very task. I haven’t done it in a while, so I asked Twitter, and here’s what I got:
@BrentO overview of topology, security model, key contacts, hotspot servers, script library walk over, sqlpowerdoc all servers, …
— Johan Bijnens (@alzdba) January 13, 2014
@BrentO Any SA Passwords for which systems. Any processes using HIS(or her) credentials. Systems list, location of any documentation.
— Will (@SirWill) January 13, 2014
@BrentO ok ok list of power users, critical reports, critical jobs, problem areas, on-call rotations (of SEs), and where backups are stored.
— Jamie Bright (@BrightByNature) January 13, 2014
@BrentO 1) PWs and Backup Info 2) any custom scripts/sprocs written defined 3) documentation (if any) 4) contacts for any vendors used 5)
— Ryan McKnight (@Accidental_DBA_) January 13, 2014
I kind of expected answers like that – a lay of the land, and Buck Woody’s runbook is a good example – but there were so many good creative ideas:
@BrentO anything that's non standard and why it's like that. There's usually a good reason but the new person won't know that.
— Kent Chenery (@sqlconcepts) January 13, 2014
— Allan Hirt (@SQLHA) January 13, 2014
When you look at someone else’s server, it’s so easy to think, “This guy couldn’t have known what he was doing – nobody should ever set this trace flag or this database option or use this cumulative update.” Maybe not, though – maybe it’s critical to making the application work.
@BrentO Walk through of custom scripts/code that the company uses, i.e. roll-your-own index maintenance.
— Mike Fal (@Mike_Fal) January 13, 2014
How many of us left little snippets of code around that ended up becoming mission-critical? Often we don’t have source code control on these, either.
@BrentO "Do you have a copy of your DR plan?"
— Christopher Baigent (@BBassic) January 13, 2014
Does the company even have one?
@BrentO Another one (maybe not as crucial) when was the last time the backups were verified (IE: restored).
— John Morehouse (@SqlrUs) January 13, 2014
Have you tested a restore lately?
@BrentO "What's the current monitoring solution?" <- breaks out into SLA conversations, What's the most frequently encountered issue etc.
— John Sansom (@SqlBrit) January 13, 2014
And a followup – are you getting value out of it, or have you just set up email rules to bozo-bin all of the monitoring emails? It helps to know what to expect when the emails start coming your way.
— Chuck Rummel (@crummel4) January 13, 2014
Wow – that’s a fantastic answer. Often we just didn’t have the time to automate all of our problems, and we know that the flux capacitor will leak if we don’t clean it out weekly. Followed up with:
@BrentO a list of all the things they wanted to do but never got a chance to, due to being in constant firefighting mode. (aka time bombs)
— Chuck Rummel (@crummel4) January 13, 2014
Now might be a good time to get that project approved.
@BrentO A list of who contacted them the most and why, so you can learn undocumented hot spots of trouble. It may not all be in emails.
— Chuck Rummel (@crummel4) January 13, 2014
Being a DBA takes soft skills. Getting this inside info on the squeaky wheels helps me prepare for the incoming support requests. Which brings me to:
— John Sansom (@SqlBrit) January 13, 2014
And I’d follow that up with, “Why are you leaving?” Sometimes that helps reveal some of the biggest land mines.
I first knew something was up when I looked at the job history for a simple maintenance plan. It had two steps:
- Rebuild all the indexes in the database – this took 10 minutes each night.
- Update statistics – this took 2-3 hours each night.
Maintenance Plans light the fuse
I love the concept of maintenance plans, but I don’t love the way all the tasks are set up.
In the case I was looking at, the Update Statistics task was being used with two values that are set by default:
- Run against all statistics
- Update them with fullscan
“All” statistics means that both “column” and “index” statistics will be updated. There may be quite a lot of statistics — most people leave the “auto create statistics” option enabled on their databases, which means that queries will dynamically cause the creation of more and more statistics over time.
Combined with “fullscan”, updating all statistics can become a significant amount of work. “Fullscan” means that to update a statistic, SQL Server will scan 100% of the values in the index or column. That adds up to a lot of IO.
Why ‘SELECT StatMan’ repeatedly scans tables
If SQL Server needs to update column level statistics for the same table, it could potentially use a single scan and update multiple stats, right?
Because of the runtimes I was seeing, I was pretty sure that wasn’t happening. But we can take a closer look and see for ourselves.
In our maintenance plan task, if we hit “View TSQL”, a window pops up showing us the comamnds that the plan will run. (I love this feature, by the way!) We will use one of these commands to test things out in a bit.
First, let’s make sure we have some column level statistics on our database. It already has indexes and their associated stats. To create some column level stats, I run these queries:
--create two column stats using 'auto create statistics' select * from Person.Person where MiddleName like 'M%'; select * from Person.Person where Title is not null; GO --Create two filtered stats on Title create statistics kl_statstest1 on Person.Person (Title) where Title = 'Mr.' GO create statistics kl_statstest2 on Person.Person (Title) where Title = 'Ms.' GO
That will create two “auto” stats what start with “_WA_Sys”, and two stats that I named myself. To check ‘em out and see ALL the index and column stats on the table, we run:
exec sp_helpstats 'Person.Person', 'All'; GO
Sure enough, this shows us that we have seven stats total– three are related to indexes.
Alright, time to run that sample command excerpted from our maintenance plan. I start up an Extended Events trace to capture IO from sp_statements completed, then run the command the maintenance plan was going to use to update every statistic on this table with fullscan:
UPDATE STATISTICS [Person].[Person] WITH FULLSCAN GO
Here’s the trace output –click to view it in a larger image:
Looking at the Extended Events trace output, I can see the commands that were run as well as their logical reads. The commands look like this:
SELECT StatMan([SC0]) FROM (SELECT TOP 100 PERCENT [Title] AS [SC0] FROM [Person].[Person] WITH (READUNCOMMITTED) WHERE ([title]='Mr.') ORDER BY [SC0] ) AS _MS_UPDSTATS_TBL OPTION (MAXDOP 16)
The “logical_reads” column lets me know that updating four of these statistics had to do four separate scans of my table– and three of them are all on the Title column! (Doing a SELECT * FROM Person.Person shows 5,664 logical reads by comparison.)
IO was lower for statistics related to nonclustered indexes because those NC indexes have fewer pages than the clustered index.
A better way to update statistics: Let SQL Server pick the TABLESAMPLE
If you just run the TSQL command ‘UPDATE STATISTICS Person.Person’ (without telling it to scan all the rows), it has the option to do something like this:
SELECT StatMan([SC0], [SB0000]) FROM (SELECT TOP 100 PERCENT [SC0], step_direction([SC0]) over (order by NULL) AS [SB0000] FROM (SELECT [Title] AS [SC0] FROM [Person].[Person] TABLESAMPLE SYSTEM (3.547531e+001 PERCENT) WITH (READUNCOMMITTED) ) AS _MS_UPDSTATS_TBL_HELPER ORDER BY [SC0], [SB0000] ) AS _MS_UPDSTATS_TBL OPTION (MAXDOP 1)
It dynamically figures out a sample size by which to calculate results! (It can pick a variety of options– including scanning the whole thing.)
How to configure faster, better statistics maintenance
Avoid falling for the pre-populated settings in the “Update Statistics” task in the maintenance plan. It’s rare to truly need to use FULLSCAN to update stats in SQL Server, and even when cases where it’s justified you want to implement that with statements targeting the individual statistics to update. The basic “UPDATE STATISTICS Schema.TableName” command is pretty clever– the issue is simply that Maintenance Plans don’t make it easy for you to run that!
Unfortunately, if you use maintenance plans there’s no super simple solution– it forces you to specify either fullscan or a specific sample. There’s no way to just use the basic “You compute the minimum sample” with that task.
You’ve still got good options, they’re just a few more steps:
- You could use a t-sql related task or a custom SQL Agent job to run sp_updatestats
- You could use a free index and statistics maintenance script. The example I’ve linked to is super clever, and avoids updating statistics where it has just rebuilt an index!
- You could also let auto update stats take care of the issue– that’s often just fine on small databases or where there aren’t major data fluctuations
And each of those options should chew up less IO than updating all index and column statistics with FULLSCAN.
Everyone wants reports but nobody wants to build out a separate reporting server. What options do you have short of throwing up your hands in defeat? Join Jeremiah to learn about four SQL Server technologies that help with reporting in production. This session is for DBAs and developers looking for a place to get started with reporting against SQL Server.
For the links and scripts, check out the Reporting in Production: SQL Server page.
The plan cache holds a lot of secrets about what’s going on inside SQL Server. In the First Responder Kit we shared one of our plan cache scripts to find the top resource consuming queries. That query works well, but over time we’ve added some additional functionality to the query. I figured it was time to share the new query that we’re using to analyze SQL Server performance.
Our existing query looks at individual query stats, but it doesn’t take into account stored procedure or trigger execution stats. During our SQL Critical Care® checks, we’ve found it helpful to look at both procedures and triggers to figure out if they were causing problems for overall server health.
What Are We Looking At?
The original query just looked at
sys.dm_exec_query_stats and looked at average and total metrics across CPU, duration, and logical reads.
This was a helpful approach, but over time it’s become apparent that we needed to look at more than just individual statements – what if a single procedure was causing problems that only show up in aggregate?
The Bigger Picture
To get a view of the bigger picture, we added in two more DMVs -
sys.dm_exec_procedure_stats. This gives us a big picture view of what’s going on inside SQL Server – the only thing that would make this better would be a DMV for function execution stats.
To avoid skewing results, data is aggregated by the
query_hash - unfortunately this means the queries won’t work against SQL Server 2005. Not only do we rank queries by CPU, IO, duration, and execution count, but a second level of ranking is provided that ranks queries within their logical grouping – by statement, procedure, and trigger. If you want to see what your most expensive trigger is doing, it’s easy enough to make a quick change to the query.
Check it out:
You can download the script in the usual way – by agreeing to our crazy terms of service, selling your email address to a foreign government, and clicking “download”.
Using the Query
This query is easy enough to start using right way – just download the file and run it. Once you’re familiar with it, scroll to the bottom and you’ll see two separate queries you can run. The first is suitable for pasting into Excel – it has no query plan and the SQL text is shortened to easily paste into a single cell. The second query has everything that you want.
Since data is dumped into a temporary table during analysis, it’s easy to keep re-querying the temporary table as you re-sort data or refine what you’re looking for.
Ever wonder how someone else does it? There’s no right way or wrong way, but in this 20-minute session, you can peer over Brent’s shoulder (virtually) while he takes a few Stack Overflow queries, tries various techniques to make them faster, and shows how he measures the before-and-after results.
For the links and scripts, check out the Watch Brent Tune Queries page.
Check out SQLServerBuilds.blogspot.com, a site that lists cumulative updates and service packs. Here’s a remixed version:
The first service pack seems to come out fairly quickly, but after the first one, it’s a year or more. Makes sense – you find a lot of bugs quickly, right?
Microsoft released at least one service pack per year, but … not last year. 2013 saw no service pack releases, and it’s been 442 days since the last service pack (SQL 2012 SP1) shipped.
Is Microsoft taking too long? Let’s look at each current version missing a service pack:
- SQL 2012 has been waiting 442 days for SP2 – but that’s actually right in the range for normal SP2 releases. SQL Server 2008 went 540 days before SP2.
- SQL 2008R2 has been waiting 545 days for SP3 – but that’s also less time than it took for SQL 2005 to get its SP3, so no cause for panic here yet.
- SQL 2008 has been waiting 839 days for SP4 – here we actually do see some cause for alarm, because no supported SQL Server version has gone that long without a service pack.
When you step back and take the long view, we’re not really in that bad of shape yet – but I can see why people would be disturbed that no service packs have been released in over a year. It might just be a timing coincidence, or it might be something more.
But it does beg the question – what if Microsoft just stopped releasing SQL Server service packs altogether, and the only updates from here on out were hotfixes and cumulative updates? How would that affect your patching strategy? Most shops I know don’t apply cumulative updates that often, preferring to wait for service packs. There’s an impression – correct or not – that service packs are better-tested than CUs.
When is a “Full Scan” not a Full Scan?
The performance counter “Full scans/sec” sounds pretty scary. If you see spikes in this counter, you’re likely to think that your queries are madly scanning all the data in your tables– there are no stylish, well performant seeks, just drunken queries spewing IO everywhere.
If you look up the counter in SQL Server Books Online, it seems to confirm that. Full Scans/sec is defined as the “Number of unrestricted full scans per second. These can be either base-table or full-index scans.” It’s not just a FULL scan, it’s “unrestricted.” That’s sounds really bad, right?
Let’s take a look. You can warm up your test SQL Server instance and restore your favorite sample database to play along with the sample queries below using the AdventureWorks2012 database and perfmon.
Test 1: A full scan
Here’s our first test query. This query technically does an index scan, but it’s a very efficient partial scan– it uses just 3 logical reads to hop to the end of the index and grab the last value.
--This will run 500 times declare @val1 int; select @val1=max(BusinessEntityID) from Person.Person; GO 500
500 of these finish in less than one second. Here’s what the plan looks like. (If you look deep into the properties of the scan operator, you’ll find that it says that it’s a special “Backwards” scan for speed.)
In perfmon, I can see that this query did the following:
- 500 Full Scans/sec (SQL Server:Access Methods)
- 1,908 Page lookups/sec (SQL Server: Buffer Manager – this is logical reads)
- 0 Page reads/sec (SQL Server: Buffer Manager — this is physical reads and the data was already in memory)
Hmm, so what we’ve seen here is that we can make the full scans/sec counter spike with a scan that isn’t really “unrestricted” or “full” in the way that we might think.
Full scan does NOT really mean that it read the entire index! (I’m not the first to discover this, of course. Michael J Swart gives a nice example of a TOP query doing something similar in this post on different types of scans.)
Test 2: a “worse” full scan
Let’s make our test query a bit nastier. This new version of the query is not such a great citizen — it reads 3,820 pages on each run (as opposed to 3 reads from the query before):
--The variable here is just to keep SSMS from choking on the output declare @val1 xml; select @val1 = AdditionalContactInfo from Person.Person; GO 500
This is also a scan but it does more IO. 500 of these take five seconds or so– much longer. Here’s the plan:
When we run this, we see that:
- Full scans/sec only spikes up to a max value of 97 this time– it’s much LOWER, even though we hit the SQL Server with a nastier query. We just ran this from a single SQL Server session, so it didn’t spike up as much and took longer– but it certainly isn’t better than the first query.
- Page lookups/sec spikes up to a max value of 362,161. Each run of the query is doing more logical IO– and the fact that this query does more work shows up much more clearly here.
- Page reads/sec just has a slight bump and at one point 8 reads/sec are registered. The data was already in memory, so this wasn’t really impacted at all.
The “worse” query didn’t spike the full scans/sec counter, but it did more IO. If we were designing monitoring around the full scans/sec counter then we’d probably just throw an alert on the incident that actually had less impact.
Pulling it all together
Monitoring is a tricky business. Performance counters tell you a lot about SQL Server– but if you monitor on individual thresholds on many performance counters, you’ll end up with false alarms and more questions than answers. The counters shown in this post can be useful for baselining, trending, and interpreting in a larger context of information, but by themselves they don’t tell you a full story.
If you’re getting frustrated with perf counters and want to learn what’s really dragging down your SQL Server, check out our free webcasts, our free scripts, or our performance tuning training videos.
There will come a time when your database backups will not be as fast as you would like them to be. Over time, the volume of data you’re storing will grow, and backups will take longer – but your weekly or daily maintenance windows don’t grow as well. Backups can be resource-intensive, using I/O and network resources – and if they’re running when users are trying to work, the users may notice the performance impact.
Here are two ways to make your backups go faster.
Option A: Full and Differential backups
A typical backup scenario that I see is a full backup daily and, if the database is in Full recovery model, log backups throughout the day. But what happens when that full backup begins to take hours, instead of minutes? A differential backup may help.
A full backup contains all the data in the database, and enough of the log to recover it. A differential backup contains only the extents (groups of 8K data pages) that have changed since the last full backup. Because a differential is usually smaller than a full backup, it can take less time to complete.
The commands to run the backups are very similar. A full backup:
BACKUP DATABASE AdventureWorks2012 TO DISK=N'E:\SQL Backups\AdventureWorks2012\AdventureWorks2012_Full.bak';
To perform a differential backup, you add the WITH DIFFERENTIAL option:
BACKUP DATABASE AdventureWorks2012 TO DISK=N'E:\SQL Backups\AdventureWorks2012\AdventureWorks2012_Diff.bak' WITH DIFFERENTIAL;
Combining full and differential backups throughout the day or week can reduce the time you spend waiting for backups to complete. There are two scenarios I usually see. The first is a full backup one or two days per week, with differential backups on the other days, and regular transaction log backups. This makes sense when a full backup every day would be too time-consuming. The second is a full backup daily, differentials every few hours, and regular log backups. This usually makes sense when fast recovery is necessary, and it would be better to be able to restore a full backup, a differential backup, and a subset of log backups – rather than the full backup and all log backups for the day.
This can add a layer of complexity to a restore strategy – instead of restoring the last full backup and ensuing logs, the most recent differential must be accounted for as well. You should practice this restore sequence so you understand the steps involved. You also need to monitor the size of the differential backups closely – depending on the rate of change in the database, you may not be saving as much space as hoped.
However, when your time to do backups and have them impact your users is shrinking, yet database size is growing, the full + differential strategy can be used effectively.
Option B: Backing up to multiple files
The standard database or log backup is done to one backup file:
BACKUP DATABASE AdventureWorks2012 TO DISK=N'E:\SQL Backups\AdventureWorks2012\AdventureWorks2012_Full_20140109.bak' WITH Name = 'AdventureWorks2012 Full One File';
When a backup is written to one file, one thread is assigned. You can back up to multiple files – in one or more locations – and multiple threads will be used. This can make a backup take less time – sometimes significantly. Here’s the command to create a full backup across four disks:
BACKUP DATABASE AdventureWorks2012 TO DISK=N'E:\SQL Backups\AdventureWorks2012\AdventureWorks2012_Full_1__20140109.bak', DISK=N'F:\SQL Backups\AdventureWorks2012\AdventureWorks2012_Full_2__20140109.bak', DISK=N'G:\SQL Backups\AdventureWorks2012\AdventureWorks2012_Full_3__20140109.bak', DISK=N'H:\SQL Backups\AdventureWorks2012\AdventureWorks2012_Full_4__20140109.bak' WITH Name = 'AdventureWorks2012 Full 4 Files';
Having multiple backup files can make a restore more complicated. All of the files must be available, and all must be referenced in a restore session. To restore the backup files shown below, use the following command:
RESTORE DATABASE AdventureWorks2012 FROM DISK=N'E:\SQL Backups\AdventureWorks2012\AdventureWorks2012_Full_1__20140109.bak', DISK=N'F:\SQL Backups\AdventureWorks2012\AdventureWorks2012_Full_2__20140109.bak', DISK=N'G:\SQL Backups\AdventureWorks2012\AdventureWorks2012_Full_3__20140109.bak', DISK=N'H:\SQL Backups\AdventureWorks2012\AdventureWorks2012_Full_4__20140109.bak' WITH NORECOVERY; RESTORE DATABASE AdventureWorks2012 WITH RECOVERY;
However, the benefits can be significant. I performed five tests – one file on one disk, two files on one disk, four files on one disk, two files on two disks, and four files on four disks. The results show that adding more disks decreases the backup time.
Note: these tests don’t cover all scenarios. If you have a group of disks in a RAID 5 or RAID 10 array, multiple threads writing to multiple files could be faster than one thread and one file. Lesson learned: test your backup files in your environment to determine what is best for you!
Backups Are Your Best Friend
And just like your best friend, you want to spend some time with them on a regular basis, getting to know them better. Learning tricks like the two I’ve given you here can help you when you face a situation where you need to improve performance. Having experience with these options – even if you do them in a sandbox environment or test it on AdventureWorks – is a huge asset.
Want to know even more about backups, restores, and cool things you can do with them? Check out my Backup & Recovery Step by Step training!
The latest version of our SQL Server health check stored procedure is out today. Here’s what we’ve added in the last couple of versions – big thanks to the folks who keep making this even better for the community:
Changes in v33 – January 20, 2014:
Bob Klimes fixed a bug that Russell Hart introduced in v32, hahaha. Check 59 was false-alarming on Agent jobs that actually had notifications.
Changes in v32 – January 19, 2014:
- Russell Hart fixed a bug in check 59 (Agent jobs without notifications).
- Added @EmailRecipients and @EmailProfile parameters to send the results via Database Mail. Assumes that database mail is already configured correctly. Only sends the main results table, and it will not work well if you also try to use @CheckProcedureCache. Execution plans will not render in email.
- Fixed a bug in checks 108 and 109 that showed poison waits even if they had 0ms of wait time since restart.
- Removed check 120 which warned about backups not using WITH CHECKSUM. We fell out of love with WITH CHECKSUM – turns out nobody uses it.
- Added check 121 – Poison Wait Detected: Serializable Locking – looking for waits with %LCK%R%. Happens when a query uses a combination of lock hints that make the query serializable.
- Added check 122 – User-Created Statistics In Place. There is nothing wrong with creating your own statistics, but it can cause an IO explosion when statistics are updated.
- Added check 123 – Multiple Agent Jobs Starting Simultaneously. Ran into an issue where dozens of jobs started at the exact same time every hour.
Changes in v31 – December 1, 2013:
- Dick Baker, Ambrosetti Ltd (UK) fixed typos in checks 107-109 that looked for the wrong CheckID when skipping checks, plus improved performance while he was in there.
- Dick also improved check 106 (default trace file) so that it will not error out if the user does not have permissions on sys.traces.
- Christoph Muller-Spengler @cms4j added check 118 looking at the top queries in the plan cache for key lookups.
- Philip Dietrich added check 119 for TDE certificates that have not been backed up recently.
- Ricky Lively added @Help to print inline help. I love his approach to it.
- Added check 120 looking for databases that have not had a full backup using the WITH CHECKSUM option in the last 30 days.
During next week’s Watch Brent Tune Queries webcast, I’m using my favorite demo database: Stack Overflow. The Stack Exchange folks are kind enough to make all of their data available via BitTorrent for Creative Commons usage as long as you properly attribute the source.
There’s two ways you can get started writing queries against Stack’s databases – the easy way and the hard way.
The Easy Way to Query StackOverflow.com
Point your browser over to Data.StackExchange.com and the available database list shows the number of questions and answers, plus the date of the database you’ll be querying:
At the time of this writing, the databases are updated every Monday. If you want even more recent data, you can use the Stack Exchange API, but that’s a story for another day.
Click on the site you’d like to query, and you’ll get a list of queries you can start with, or click Compose Query at the top right. As an example, let’s look at a query that compares the popularity of tags:
Yes, this is a lot like SQL Server Management Studio in the browser. At the top, we’ve got our query, plus space for a couple of parameters. One of the fun parts about Data Explorer is that you can design queries to take parameters to show information for different users, date ranges, tags, etc.
At the bottom, notice the tabs for Results, Messages, and Graph. If your results look graph-friendly, Data Explorer is smart enough to figure that out:
And yes, Data Professional, that last tab does indeed say Execution Plan, and it renders in your browser right down to the ability to hover your mouse over parts of the plan and see more details:
Some system commands (like SET STATISTICS IO ON) are allowed, but you can’t create indexes, and there aren’t many indexes to begin with. You can also shoot yourself in the foot by writing an extraordinarily ugly query, and the system won’t stop you – for example, SELECT * FROM Posts will start running, but then may crash your browser as they start returning data. Jeremiah and I managed to repeatedly kill our Chrome browsers while tuning queries for fun.
I like using this to go poking around for unusual questions or answers. For example, I like to find questions that are viewed a lot, but don’t have any upvoted answers yet. (That’s prime territory for a SQL Server geek like me that wants to find tough questions to solve.)
The Hard Way to Query StackOverflow.COM
First, you’ll need to download a copy of the most recent XML data dump. These files are pretty big – around 15GB total – so there’s no direct download for the entire repository. There’s two ways you can get the September 2013 export:
- Download them from Mega - which lets you pick the specific sites you’d like to download.
- Download them all at once via BitTorrent with the StackExchange torrent link at Archive.org
I strongly recommend working with a smaller site’s data first like DBA.StackExchange. If you decide to work with the monster StackOverflow.com’s data, you’re going to temporarily need:
- ~15GB of space for the download
- ~60GB after the StackOverflow.com exports are expanded with 7zip. They’re XML, so they compress extremely well for download, but holy cow, XML is wordy.
- ~50GB for the SQL Server database (and this will stick around)
Next, you need a tool to load that XML into the database platform of your choosing. For Microsoft SQL Server, I use Jeremiah’s improved version of the old Sky Sanders’ SODDI. Sky stopped updating his version a few years ago, and it’s no longer compatible with the current Stack dumps. Jeremiah’s current download is here, and it works with the January 2014 data dump. (The previous version works on the September 2013 data dump.)
The SODDI user interface expects the XML files to be stored in a very specific folder name: MMYYYY SITEINITIALS, like 092013 SO. SODDI will import multiple sites, and it creates a different schema for each site. I just want to import Stack Overflow by itself, all in its own database, so I like naming my folder “092013 dbo”. That way, it creates the tables in the dbo schema, which is just a little friendlier for demos.
When you run SODDI.exe without parameters, this GUI pops up (assuming that you named your Stack Overflow demo folder 092013 dbo):
Source is the folder where you saved the data dumps. It expects to see subfolders in there for 092013 dbo.
Target is the connection string for your database server. I’m using a local SQL Server (note that I picked SqlClient in the Provider dropdown) with a database named StackOverflow, so my connection string is:
Data Source=(local); Initial Catalog=StackOverflow; Integrated Security=True
If you want to use a remote SQL Server, you’d put its name in there instead of (local). You’ll also need to pre-create the database you want to use.
Click Import, and after a lot of disk churn, you’re rewarded with a StackOverflow database with tables for Badges, Comments, Posts, PostTypes, Users, Votes, and VoteTypes.
The resulting database is about 50GB. SQL Server’s data compression doesn’t work too well here because most of the data is off-row LOBs. Backup compression works well, though, with the resulting backup coming in at around 13GB.
Why Go to All This Work?
When I’m teaching performance tuning of queries and indexes, there’s no substitute for a local copy of the database. I want to show the impact of new indexes, analyze execution plans with SQL Sentry Plan Explorer, and run load tests with HammerDB.
That’s what we do in our SQL Server Performance Troubleshooting class – specifically, in my modules on How to Think Like the Engine, What Queries are Killing My Server, T-SQL Anti-patterns, and My T-SQL Tuning Process. Forget AdventureWorks – it’s so much more fun to use real StackOverflow.com data to discover tag patterns, interesting questions, and helpful users.