How to Download the Stack Overflow Database via BitTorrent

Time to make SQL Server demos a little more fun. Now you can download a torrent of the SQL Server database version of the Stack Overflow data dump.

As with the original data dump, this is provided under cc-by-sa 3.0 license. That means you are free to share this database and adapt it for any purpose, even commercially, but you must attribute it to the original authors (not us):


How to Get the Database

  1. Install a BitTorrent client – I recommend Deluge, a fast, free, easy open source one.
  2. Download & open this .torrent file – it’s a small metadata file that tells your BitTorrent client where to connect and start downloading the files.
  3. Wait. The 9GB file may take a few hours to download depending on your internet connection and how many other people are seeding the torrent.
  4. Extract the .7Zip file with 7Zip – it will create the database MDF, LDF, and a Readme.txt file, about 70GB in total. Don’t extract the files directly into your SQL Server’s database directories – instead, extract them somewhere else first, and then move or copy them into the SQL Server’s database directories. (This just avoids permissions hassles.)
  5. Attach the database – it’s in Microsoft SQL Server 2005 format, so you can attach it to any 2005 or newer instance. It doesn’t use any Enterprise Edition features like partitioning or compression, so you can attach it to Developer, Standard, or Enterprise Edition. (If your SSMS crashes or throws permissions errors, you likely tried extracting the archive directly into the database directory, and you’ve got permissions problems on the data/log files.)

Please leave the torrent up and running – seeding the torrent helps other folks get it faster.

Why I’m Using BitTorrent

BitTorrent is a peer-to-peer file distribution system. When you download a torrent, you also become a host for that torrent, sharing your own bandwidth to help distribute the file. It’s a free way to get a big file shared amongst friends.

The 9GB download is relatively large, so it would be expensive for me to host on a server. For example, if I hosted it in Amazon S3, I’d have to pay around $1 USD every time somebody downloaded the file. I like you people, but not quite enough to go around handing you dollar bills. (As it is, I’m paying for a seedbox to get this thing started.)

Some corporate firewalls understandably block BitTorrent because it can use a lot of bandwidth, and it can also be used to share pirated movies/music/software/whatever. If you have difficulty running BitTorrent from work, you’ll need to download it from home instead.

What’s Inside the StackOverflow Database

I want you to get started quickly while still keeping the database size small, so:

  • All tables have a clustered index
  • No other indexes are included (nonclustered or full text)
  • The log file is small, and you should grow it out if you plan to build indexes or modify data
  • It’s distributed as an mdf/ldf so you don’t need space to restore a backup – just attach it
  • It only includes data, not data for other Stack sites
  • The PostHistory table is not included

To get started, here’s a few helpful links:

Forgotten Maintenance – Cycling the SQL Server Error Log

Most of us get caught up in fragmentation, finding the slowest queries, and looking at new features. We forget the little things that make managing a SQL Server easier – like cylcing the SQL Server error logs.

What’s the Error Log?

The SQL Server error log is a file that is full of messages generated by SQL Server. By default this tells you when log backups occurred, other informational events, and even contains pieces and parts of stack dumps. In short, it’s a treasure trove of information. When SQL Server is in trouble, it’s nice to have this available as a source of information during troubleshooting.

Unfortunately, if the SQL Server error log gets huge, it can take a long time to read the error log – it’s just a file, after all, and the GUI has to read that file into memory.

Keep the SQL Server Error Log Under Control

It’s possible to cycle the SQL Server error log. Cycling the error log starts a new file, and there are only two times when this happens.

  1. When SQL Server is restarted.
  2. When you execute sp_cycle_errorlog
Change everything!

Change everything!

When SQL Server cycles the error log, the current log file is closed and a new one is opened. By default, these files are in your SQL Server executables directory in the MSSQL\LOG folder. Admittedly, you don’t really need to know where these are unless you want to see how much room they take up.

SQL Server keeps up to 6 error log files around by default. You can easily change this. Open up your copy of SSMS and:

  1. Expand the “Management” folder.
  2. Right click on “SQL Server Logs”
  3. Select “Configure”
  4. Check the box “Limit the number of error log files before they are recycled”
  5. Pick some value to put in the “Maximum number of error log failes” box
  6. Click “OK”

It’s just that easy! Admittedly, you have to do this on every SQL Server that you have, so you might just want to click the “Script” button so you can push the script to multiple SQL Servers.

Automatically Rotating the SQL Server Error Log

You can set up SQL Server to automatically rotate your error logs. This is the easiest part of this blog post, apart from closing the window.

To cycle error logs on a regular basis, restart your SQL Server nightly.

Only joking.

You can set up a SQL Agent job with a T-SQL step. All it has to do is EXEC sp_cycle_errorlog. Schedule the SQL Agent job to run as frequently as you’d like and you’re good to go. The upside of this approach is that it’s automatic and the SQL Server error logs will be more granular, making it easier to find the error messages you’re looking for.

It’s Just That Easy!

Cycling the SQL Server error log is easy – you just need a regularly scheduled agent job. Rotating the logs makes it easier to find error messages. Let’s face it – you’re only looking for error messages when there’s a problem. That’s all there is to rotating the error logs.

Window Functions and Cruel Defaults

My First Post Here…

Well, my first technical post, was about how the default index creation method is OFFLINE. If you want that sweet, sweet Enterpri$e Edition ONLINE goodness, you need to specify it. It’s been a while since that one; almost six months to the day. So here’s another one!

But Window Functions Are Awesome

Heck yeah they are. And how. Boy howdy. Etc. You get the point. I’m enthusiastic. What can be cruel about them? Glad you asked!

Window Functions, according to the almighty ANSI Standard, have two ways of framing data: RANGE and ROWS. Without getting into the implementation differences between the ANSI Standard and Microsoft’s versions, or any performance differences between the two, there’s a funny difference in how they handle aggregations when ordered by non-unique values. A simple example using the Stack Overflow database follows.

SELECT  [OwnerUserId] ,
        CAST([CreationDate] AS DATE) AS [DumbedDownDate] ,
        [Score] ,
        SUM([Score]) OVER ( ORDER BY CAST([CreationDate] AS DATE) ) AS [Not_Specified] ,
        SUM([Score]) OVER ( ORDER BY CAST([CreationDate] AS DATE) RANGE UNBOUNDED PRECEDING ) AS [Range_Specified] ,
        SUM([Score]) OVER ( ORDER BY CAST([CreationDate] AS DATE) ROWS UNBOUNDED PRECEDING ) AS [Rows_Specified]
FROM    [dbo].[Posts]
WHERE   [OwnerUserId] = 1
        AND CAST([CreationDate] AS DATE) BETWEEN '2008-08-01'
                                         AND     '2008-08-31'
ORDER BY CAST([CreationDate] AS DATE);

For the month of August, Year of Our Codd 2008, we’re getting a running total of the score for posts by UserId 1. Who is UserId 1? I’ll never tell. But back to the syntax! In the first SUM, we’re not specifying anything, for the next two we’re specifying RANGE and then ROWS. Why? REASONS! And why am I casting the CreateDate column as a date? MORE REASONS!

Before you scroll down, think for a second:

If I don’t specify RANGE or ROWS, which will SQL Server use?
If I left the CreateDate column as DATETIME, what eff aff difference would it make to the output?

Do you see a pattern forming here?



When we don’t specify RANGE or ROWS, well, SQL Server is nice enough to pick RANGE for us. “Nice”.

Whose fault? Default!

Whose fault? Default!

Deep breaths, Erik. Deep breaths.

You should also notice the difference in how each different method aggregates data. When the ordering column has duplicates, RANGE, and by extension, the default method, will SUM all the values for that group at once. When ROWS is specified as the framing method, you see the running total that most people are after.

Make project managers happy!

Make project managers happy!

And, of course, if all the values were unique, they’d do the same thing.

SELECT  [OwnerUserId] ,
        [CreationDate] ,
        [Score] ,
        SUM([Score]) OVER ( ORDER BY [CreationDate] ) AS [Not_Specified] ,
        SUM([Score]) OVER ( ORDER BY [CreationDate]  RANGE UNBOUNDED PRECEDING ) AS [Range_Specified] ,
        SUM([Score]) OVER ( ORDER BY [CreationDate]  ROWS UNBOUNDED PRECEDING ) AS [Rows_Specified]
FROM    [dbo].[Posts]
WHERE   [OwnerUserId] = 1
        AND CAST([CreationDate] AS DATE) BETWEEN '2008-08-01'
                                         AND     '2008-08-31'
ORDER BY [CreationDate];
Back for a day

Back for a day

Wrap. It. Up.

This one is pretty self explanatory. If you’re lucky enough to be on SQL Server 2012 or greater, and you’re using Window Functions to their full T-SQL potential, it’s was easier to calculate running totals. Just be careful how you write your code.

If you like this sort of stuff, Check out Doug’s new video series, T-SQL Level Up. There are next to zero fart jokes in it.

Announcing Our 2016 In-Person Training Class Schedule

Jeremiah teaching in Portland

Jeremiah teaching in Portland

We just finished up the last of our 2015 training classes, and the reviews have been glowing:

“I have attended many training classes in my 19 years as a DBA, including Oracle and SQL Server. This is the first class where 75+% of the information I learned is applicable to my current work environment.” – Christina Z.

“I thought the class was much more useful than the Microsoft Official Curriculum. Those classes seem to teach to the book and instructors often can’t answer questions that aren’t included in the curriculum. You guys have real-world experience that you can share with the students. Very valuable!” – Kris C.

“They’re really best of the best on the market. And always open to QA sessions. And they asked do you have any questions? And ALWAYS have answers!!!!” – Vlad B.

And now, registration is open for next year’s training class lineup. Drum roll please:

Advanced Querying and Indexing – 5 days, $3,995

  • Jan 11-15, Newark
  • August 8-12, Portland

SQL Server Performance Troubleshooting4 days now expanded to 5 days, $3,995:

  • February 22-26, San Diego
  • April 11-15, Newark
  • June 6-10, Chicago
  • August 15-19, Portland
  • December 5-9, Philadelphia

The Senior DBA Class of 2016 – 5 days, $3,995

  • March 7-11, Chicago
  • May 2-6, Denver
  • August 1-5, Philadelphia

We’ve been updating each class to reflect new technologies and techniques, and we’re excited to share what we’ve been teaching our clients. And of course, if you’d like to bring one of us onsite to teach a custom class lineup, check out our training catalog and email us.

Announcing: Level Up Your T-SQL

You spend at least an hour or two each day writing T-SQL code. You have a pretty good hunch that it’s not as fast or crisp as it should be, and you wonder if you’re missing a few techniques. You’ve heard about date tables and windowing functions, but you’re just not comfortable with ’em yet.

I’m Doug Lane of Brent Ozar Unlimited, and my new online training class is for you. It’s loaded with T-SQL goodies like using computed columns, how to replace cursors, and more. Here’s a sneak preview:

It’s unlike any T-SQL training you’ve ever seen (seriously!), and it’s just $29. I’d love to hear what you think.

Our customers, who got free access to it as a thank-you, have already started leaving reviews:

“Great job on delivering information with a wink and a nod. It really held my interest.” – Debby

“Doug is an exceptional speaker, and uses humour in effective ways to convey difficult concepts. I learnt a thing or two from this course.” – Randolph

“If only l had seen this course when l started out my SQL journey! I love the teaching style it works well for me. Then there are those simple time saving tips slipped in just at the right time. The analogy about banking the coins, genius wish l had thought of it! Teaching pace it timed perfectly. Overall looking forward to seeing more :-)” – Robert

“This is perhaps the nerdiest thing I’ve seen all year.” – Wulfgar

“Great starter for thinking set-based!” – Gustavo

“I loved the Set Based sections with the examples of the cursors and how you could change the code to make them set based operations. But I think my favorite part was the windowing functions. Now I have good examples of how they work and when they are a good fit.” – Colin

“This course is funny and filled with a lot of good information. A great recap for any DBA. And a gold mine for anybody less experimented with coding T-SQL. The Windowing functions part is especially helpful. I suggest this course to anybody writing T-SQL.” – Benoit

What’s More Popular: SQL Server 2014, or SQL Server 2005?

Sure, you’d much rather work with SQL Server 2014, but what’s more prevalent out in the real world? At Dell DBA Days, I saw a chart that shocked me:

SQL Server version popularity

SQL Server version popularity as of Sept 2015

That chart shows the number of servers running Dell Spotlight Essentials, a free SQL Server monitoring tool. Note that SQL Server 2014 has 4% of the population, about 1/4 as much as SQL Server 2005.

But the terrifying part: for every 2 SQL Server 2014 instances, there’s one SQL Server 2000 instance still kicking around! That’s not exactly a rapid adoption rate.

I wonder why we aren’t we upgrading our SQL Server 2000 instances? (And I’m not pointing the figure at you – it’s likely the business that’s making you keep it around.)

Getting Started with Diskspd

Diskspeed, or diskspd.exe, is Microsoft’s replacement for SQLIO. While I’m not going to replace our SQLIO tutorial, you can use the information in here to replicate the same type of tests on your systems to get the information you need. During the Dell DBA Days, Doug and I used diskspd as one of our techniques for getting a baseline of raw performance. We wanted to get an idea of how fast the servers and storage before running SQL Server specific tests.

How do I get diskspd?

You can download diskspd directly from Microsoft – Diskspd, a Robust Storage Testing Tool, Now Publically Available. That page has a download link as well as a sample command.

The upside is that diskspd is a fully self-contained download. You don’t need Java, .NET, or anything else installed to run it. Apart from Windows – you’ll still need Windows.

How do I use diskspd?

With the command line, of course!

In all seriousness, although diskspd is the engine behind Crystal Disk Mark, it stands on its own as a separate tool. Download the executable and unzip it to an appropriate folder. There are going to be three sub-folders:

  • amd64fre – this is what you need if you have a 64-bit SQL Server
  • armfre
  • x86fre

I took the diskspd.exe file from the appropriate folder and dumped it in C:\diskspd so I could easily re-run the command. Let’s fire up a command prompt and try it out.

Here’s a sample that we ran: diskspd.exe -b2M -d60 -o32 -h -L -t56 -W -w0 O:\temp\test.dat > output.txt

Breaking it down:

  • -b2M – Use a 2 MB I/O size. For this test, we wanted to simulate SQL Server read ahead performance.
  • -d60 – Run for 60 seconds. I’m lazy and don’t like to wait.
  • -o32 – 32 outstanding I/O requests. This is your queue depth 32.
  • -h – This disables both hardware and software buffering. SQL Server does this, so we want to be the same.
  • -L – Grab disk latency numbers. You know, just because.
  • -t56 – Use 56 threads per file. We only have one file, but we have 56 cores.
  • -W – Warm up the workload for 5 seconds.
  • -w0 – No writes, just reads. We’re pretending this is a data warehouse.
  • D:\temp\test.dat – our sample file. You could create a sample file (or files) by runningdiskspd with the -c<size> flag.
  • > output.txt – I used output redirection to send the output to a file instead of my screen.

How do I read diskspd results?

You’re going to get a lot of information back from this command. You’re going to want to close the window and back away quickly. Don’t. This is good stuff, I promise.

The first thing you’ll see is a recap of the command line you used. Then you’ll immediately see a summary of the commands:

timespan:   1
duration: 60s
warm up time: 0s
cool down time: 0s
measuring latency
random seed: 0
path: 'O:\temp\test.dat'
    think time: 0ms
    burst size: 0
    software and hardware write cache disabled
    performing read test
    block size: 2097152
    using interlocked sequential I/O (stride: 2097152)
    number of outstanding I/O operations: 32
    thread stride size: 0
    threads per file: 56
    using I/O Completion Ports
    IO priority: normal

That’s a lot easier than trying to read a set of command line flags. Six months from now, I can review older runs of diskspd and understand the options that I used. diskspd is already winning over sqlio.

Next up, you’ll see a summary of CPU information. This information will help you understand if your storage test is CPU bottlenecked – if you know the storage has more throughput or IOPS capability, but your tests won’t go faster, you should check for bottlencks. The last line of this section (and every section) will provide an average across all CPUs/threads/whatevers.

actual test time:   60.01s
thread count:       56
proc count:     56

CPU |  Usage |  User  |  Kernel |  Idle
   0|  23.17%|  10.36%|   12.81%|  76.83%
   1|   4.24%|   2.40%|    1.85%|  95.76%
   2|   9.71%|   7.37%|    2.34%|  90.29%
   3|   1.48%|   0.70%|    0.78%|  98.52%
avg.|  14.11%|  12.48%|    1.63%|  85.89%

After the CPU round up, you’ll see a total I/O round up – this includes both reads and writes.

Total IO
thread |       bytes     |     I/Os     |     MB/s   |  I/O per s |  AvgLat  | LatStdDev |  file
     0 |      2950692864 |         1407 |      46.89 |      23.44 | 1323.427 |   107.985 | O:\temp\test.dat (50GB)
     1 |      3013607424 |         1437 |      47.89 |      23.94 | 1310.516 |   141.360 | O:\temp\test.dat (50GB)
     2 |      2950692864 |         1407 |      46.89 |      23.44 | 1319.540 |   113.993 | O:\temp\test.dat (50GB)
     3 |      2950692864 |         1407 |      46.89 |      23.44 | 1315.959 |   122.280 | O:\temp\test.dat (50GB)
total:      167975583744 |        80097 |    2669.28 |    1334.64 | 1307.112 |   162.013

Look at all of those bytes!

If the I/O numbers initially seem small, remember that the data is split up per worker thread. Scroll down to the bottom of each section (total, reads, writes) and look at the total line. This rounds up the overall volume of data you’ve collected. The I/Os are recorded in whateverunit of measure you supplied. In our case, the I/Os are 2MB I/Os.

Important Sidebar Your storage vendor probably records their I/O numbers in a smaller I/O measurement, so make sure you do some rough translation if you want to compare your numbers to the advertised numbers. For more discussion, visit IOPS are a scam.

Finally, latency! Everybody wants to know about latency – this is part of what the end users are complaining about when they say “SQL Server is slow, fix it!”

  %-ile |  Read (ms) | Write (ms) | Total (ms)
    min |     13.468 |        N/A |     13.468
    max |   1773.534 |        N/A |   1773.534

This table keeps the min, max, and a variety of percentiles about how the storage performed while you were beating on. This information is just as helpful as the raw throughput data – under load your storage may have increased latencies. It’s important to know the storage will behave and respond under load.

How often should I use diskspd?

Ideally, you should use diskspd whenever you’re setting up new storage or a new server. In addition, you should take the time to use diskspd when you make big changes to storage – use diskspd to verify that your changes are actually an improvement. No, diskspd doesn’t include the work that SQL Server does, but it does show you how your storage can perform. Use it to make sure you’re getting the performance you’ve been promised by your storage vendor.

Performing Your Own SQL Server Triage

Our new Triage Specialist will be using tools, forms, and methodologies that we build for our SQL Critical Care® clients, but we also make a lot of those tools available to the public for free in our First Responder Kit.

Here’s how to do your own quick health assessment for free:

First, start your health check with sp_Blitz®. It’s a stored procedure that gives you a prioritized list of problems on your SQL Server. If the databases are all third party apps that you can’t control, use the @CheckUserDatabaseObjects = 0 parameter to focus the results on the things you can change. sp_Blitz® works on SQL Server 2005 and newer, but it doesn’t work on databases in SQL Server 2000 compatibility mode.

If you don’t think one of the resulting rows is important, or if you don’t understand the warning, copy the contents of the URL column into your browser and spend a few minutes reading the details. For example, at least once a month, we run across a server having performance problems, and it turns out not all of the CPU and memory is actually available to SQL Server. Sure enough, when we look at sp_Blitz®, it warns about CPU schedulers being offline, and the admin says something to the effect of, “Oh, I saw that in sp_Blitz®’s output, but I wasn’t really sure what it meant because I didn’t read the URL.”

Next, check your server’s bottlenecks with sp_AskBrent®. This stored proc was designed for live real-time performance troubleshooting, but it has another cool use. Use the @Seconds = 0, @ExpertMode = 1 parameters, and the second result set will show you your server’s primary wait stats since SQL Server was restarted (or since someone cleared the wait stats.) This helps guide you toward the type of performance bottleneck you’re troubleshooting – is it a CPU bottleneck, storage issue, or locking?

Armed with that knowledge, check your top resource-intensive queries with sp_BlitzCache®. Once you know the kind of bottleneck you’re looking for, use the @sort_order parameters to get to the right queries:

  • Looking for high CPU consumers? @sort_order = ‘cpu’
  • Tracking down the source of PAGEIOLATCH waits due to queries dragging a lot of data back from disk? @sort_order = ‘reads’
  • Wondering who’s getting blocked for long periods of time? Try @sort_order = ‘duration’

Then make the queries more efficient with sp_BlitzIndex®. Now that you know the worst queries and the database they’re in, run sp_BlitzIndex® in that database and get a psychological profile of your indexes.

  • Dastardly deadlocks? Look for the Aggressive Indexes warnings about high lock waits on an index.
  • Slow selects? Check out the high-value missing index warnings.
  • Intolerable inserts? Read the Hoarder warnings about tables that have a high number of nonclustered indexes that aren’t getting used.

We slather trademarks all over these tools because we’re really proud of ’em. We use them every single day in our consulting work, and we want you to use them in your triage work too. You might even wanna just practice using ’em in case we ever ask you to use ’em during our job interviews. (That’s a hint.)

Why RPO and RTO Are Actually Performance Metrics Too

Most companies come to us saying, “The SQL Server isn’t fast enough. Help us make it go faster.”

They’re kinda surprised when one of the first things we fill out together is a variation of our High Availability and Disaster Recovery Planning Worksheet:

They say things like, “Wait, I’m having a performance problem, not an availability problem.”

But as we start to look at the SQL Server, we often find a few disturbing truths:

  • The admins have stopped doing transaction log backups and DBCC CHECKDB because the server can’t handle the performance hit
  • Management thinks the database can’t lose any data because communication wasn’t clear between admins and managers
  • The server isn’t even remotely fast enough to start handling data protection, let alone the end user query requirements

I know you think I’m making this up. I know you find this hard to believe, dear reader, but not everyone is a diligent DBA like you. Not everyone has their RPO and RTO goals in writing, tests their restores, and patches their SQL Server to prevent known corruption bugs. I hope you’re sitting down when you read this, but there are some database administrators out there who, when given the choice between index rebuilds and transaction log backups, will choose the former on a mission-critical 24/7 system.

I’m sure that would never be you, dear reader, but these database administrators are out there, and they’re exactly the kinds of shops who end up calling us for help.

Then the fun part is that once we establish what the business really wants in terms of data safety, it often dictates a new server – say, moving from a single standalone VM with no HA/DR up to a full-blown failover cluster. And in the process of sketching out that new cluster, we can solve the performance problems at the same time without any changes to their application. (Throwing hardware isn’t the right answer all the time – but when you need to add automatic failover protection, it’s the only answer.)

That’s why we’re hiring a Triage Specialist: a person who can quickly assess the difference between what the business needs and what the system is really delivering, and get everyone on the same page as to what needs to happen next. And while you, dear reader, may not be the right person to apply for this position, I strongly encourage you to make sure your RPO/RTO knowledge is shored up. After all, you don’t want to have that awkward discussion with our Triage Specialist.

Scaling Up Our SQL Critical Care®

Since we’re hiring again, it’s a good time to explain what we do and how it’s evolved over the last four years.

When your SQL Server’s performance or reliability is causing you business pain, and you can’t get the relief you want, you call us for a SQL Critical Care®. We use specialized tools we’ve built to quickly get to the root cause. Then we explain it to you in clear, friendly terms, and train you how to get permanent pain relief.

Notice how we keep using medical terms? That’s no coincidence – we’re very much like a doctor’s office. Over the course of performing hundreds of these, and because the waiting room keeps filling up, we’ve scaled this to where we treat it like a hospital.

You can tell the drugs kicked in because I'm smiling.

You can tell the drugs kicked in because I’m smiling.

Remember that time I broke my arm saving a nun from a car wreck carrying a flat panel monitor? I went into the hospital, the triage nurse made sure I wasn’t having a heart attack, then got me into radiology for scans, then a doctor reviewed the situation, and he brought in an orthopedic surgeon for a consult on whether they should operate. The hospital had all of these experts available, and they rotated in and out of the room based on my needs.

Because of our scale, we’re able to bring that same approach to SQL Server pain relief.

First, a Triage Specialist reviews your RPO and RTO objectives, learns how you use SQL Server, collects diagnostic information, and prescribes changes to avoid data loss. Then, based on what the Triage Specialist finds, they bring in different specialists. The situation calls for plan guides? You’ll get a T-SQL surgeon who’s used that unusual trick before. Need pain relief for an AG in EC2 that randomly fails over? You’ll be able to work with an AlwaysOnologist who’s built up specialized tools to diagnose it.

We couldn’t use this approach when we only had a few consultants, but now that we’re doing so many of these, fun options open up.