A Presenter’s Guide to the Stack Overflow Database

Last Updated August 8, 2018

You present on SQL Server topics at user groups and conferences, and you’ve been wondering how to get started with the Stack Overflow public database. Here’s a quick list of things to know.

For stable demos, use StackOverflow2010. This smaller 10GB database has data from the first years of Stack Overflow’s history. It doesn’t change, so you don’t have to worry about updating your screenshots and metrics every time Stack releases a new data dump.

Your attendees can download it from BrentOzar.com/go/querystack as a 1GB zip file with a SQL 2008 database, and then extract it to the full database. No registration is required. You’re also welcome to distribute that database yourself.

Every table has Id as a clustering key. That by itself isn’t a big deal, but here’s where it gets awesome: Stack Overflow’s URLs are all driven by that key. Look at your URL bar in your browser as you’re surfing StackOverflow.com, and you’ll start to recognize what’s going on:

https://stackoverflow.com/users/22656/jon-skeet

Simply drop off the strings at the end of the URLs, and these work as well, making it really fun to show the pages of the data you’re looking at:

https://stackoverflow.com/users/22656

Speaking of 22656, Jon Skeet’s data is unusual. Jon Skeet is a legendary user at Stack Overflow with over a million reputation points. He’s user id #22656, and that’s one number you’ll probably end up memorizing. If you want to do a parameter sniffing demo or show wild swings in data distribution, 22656 is your man.

dbo.Posts contains both questions and answers. Most of Stack’s tables are fairly intuitive, but this one is a bit of a gotcha. To join them, use the ParentId field:

SELECT *
  FROM dbo.Posts q
  LEFT OUTER JOIN dbo.Posts a ON q.Id = a.ParentId;

1

2

3

SELECT *

FROM dbo.Posts q

LEFT OUTER JOIN dbo.Posts a ON q.Id = a.ParentId;

Note that I’m using a left outer join because not all questions have answers.

Posts are more than just questions and answers, too – the dbo.PostTypes table lists other kinds of posts, like Wiki, TagWiki, TagWikiExcerpt. Again, note the lumpy data distribution – this database is absolutely fantastic for lumpy distribution by date, time, scores, lengths of strings, you name it. It’s real data from real humans – just like your day job – and it’s fantastically unreliable and fun.

Data.StackExchange.com has lots of useful queries. When I wanna find real-world queries to show for tuning examples, it’s a great place to start. Just make sure you properly credit the query’s author and link back to the query’s page. Note that when you click on a query link, you’ll see the results instantly – that doesn’t mean the query is fast. Stack caches those query results.

The database schema isn’t exactly Stack’s current live schema. The database reflects the public data dump, not a backup of Stack Overflow’s database. For example, on dbo.Posts, the Tags column stores the tags for a particular question. If you want to find queries for a given tag, you have to do a string search for ‘%<sql-server>%’ – but that isn’t necessarily indicative of how the live site searches for tags today. I love it, though, because it shows how a lot of real-world databases work.

For questions about it, hit Meta. Meta.StackExchange.com is the Q&A site that asks questions about Stack Overflow itself. There’s a good starter post for the database documentation. When you see the term SEDE, it’s referring to Stack Exchange Data Explorer, aka data.stackexchange.com.

Joe Obbish
August 17, 2018 5:47 pm

Great post, Brent!

I use the bigger database for one of my presentations just because some of my demos need lots of data in the database. I work around the demo stability problem by only taking a new cut of the data when I’m ready to invest time in verifying that all of the demos still work. I agree with you that you should use the 10 GB database if you can, if for no other reason that it makes it easier for your audience to go home and play around with your demos if they wish.

Reply
- Brent Ozar
  August 18, 2018 7:50 am
  
  Whoa, that’s odd – I could have sworn you were a dedicated AdventureWorks guy. I really liked your epic “Trillion Row Sales.SalesOrderDetail” post.
  
  Reply
David Wood
September 21, 2018 9:38 am

Hey Brent,
I’m just getting started with the StackOverflow database (the small one). So far it doesn’t look too bad. I wondered if you have any insert query samples already prepared to manually load data into it? I trying to evaluate the impact of CDC on 2017 and wanted something a little more weighty than simply performing insert/update/deletes to the PostTypes table.

Thanks for all you do for the SQL community! I’m always learning something new from you and your team.

Reply
- Brent Ozar
  September 21, 2018 9:46 am
  
  David – thanks, glad you like our work! Yes, we do have workloads as part of our Mastering classes, but I don’t share those here. You’re welcome to join the classes though! 😀
  
  Reply