You present on SQL Server topics at user groups and conferences, and you’ve been wondering how to get started with the Stack Overflow public database. Here’s a quick list of things to know.
For stable demos, use StackOverflow2010. This smaller 10GB database has data from the first years of Stack Overflow’s history. It doesn’t change, so you don’t have to worry about updating your screenshots and metrics every time Stack releases a new data dump.
Your attendees can download it from BrentOzar.com/go/querystack as a 1GB zip file with a SQL 2008 database, and then extract it to the full database. No registration is required. You’re also welcome to distribute that database yourself.
Every table has Id as a clustering key. That by itself isn’t a big deal, but here’s where it gets awesome: Stack Overflow’s URLs are all driven by that key. Look at your URL bar in your browser as you’re surfing StackOverflow.com, and you’ll start to recognize what’s going on:
Simply drop off the strings at the end of the URLs, and these work as well, making it really fun to show the pages of the data you’re looking at:
Speaking of 22656, Jon Skeet’s data is unusual. Jon Skeet is a legendary user at Stack Overflow with over a million reputation points. He’s user id #22656, and that’s one number you’ll probably end up memorizing. If you want to do a parameter sniffing demo or show wild swings in data distribution, 22656 is your man.
dbo.Posts contains both questions and answers. Most of Stack’s tables are fairly intuitive, but this one is a bit of a gotcha. To join them, use the ParentId field:
FROM dbo.Posts q
LEFT OUTER JOIN dbo.Posts a ON q.Id = a.ParentId;
Note that I’m using a left outer join because not all questions have answers.
Posts are more than just questions and answers, too – the dbo.PostTypes table lists other kinds of posts, like Wiki, TagWiki, TagWikiExcerpt. Again, note the lumpy data distribution – this database is absolutely fantastic for lumpy distribution by date, time, scores, lengths of strings, you name it. It’s real data from real humans – just like your day job – and it’s fantastically unreliable and fun.
Data.StackExchange.com has lots of useful queries. When I wanna find real-world queries to show for tuning examples, it’s a great place to start. Just make sure you properly credit the query’s author and link back to the query’s page. Note that when you click on a query link, you’ll see the results instantly – that doesn’t mean the query is fast. Stack caches those query results.
The database schema isn’t exactly Stack’s current live schema. The database reflects the public data dump, not a backup of Stack Overflow’s database. For example, on dbo.Posts, the Tags column stores the tags for a particular question. If you want to find queries for a given tag, you have to do a string search for ‘%<sql-server>%’ – but that isn’t necessarily indicative of how the live site searches for tags today. I love it, though, because it shows how a lot of real-world databases work.
For questions about it, hit Meta. Meta.StackExchange.com is the Q&A site that asks questions about Stack Overflow itself. There’s a good starter post for the database documentation. When you see the term SEDE, it’s referring to Stack Exchange Data Explorer, aka data.stackexchange.com.