A Gentle Introduction To the Stack Overflow Schema

Stack: Dumped

Wanna learn the schema of the free Stack Overflow public database? If you want the Full Nelson, it’s all documented here. That’s a great resource if you’re looking to learn more, but generally when you’re looking at a data dump and building up demo queries, you wanna get an idea of where the data skews, and how the tables relate.

What took me forever to break out of was always using the Id column in the Users table — it’s the PK/CX, and it’s just too easy for the optimizer to make good query plans from that. To help you get the optimizer making some bad choices, here are some other ways to think of the table relationships.

This isn’t exhaustive, it’s just for the four main tables that we use when writing demo queries. (Note: I’m using > to denote relationships, not that one side is greater than the other. They’re both great. They both get a gold star on the fridge.)

Badges
Comments
Users
Posts
Votes

Toolin’ Around

To get an idea about where interesting data distributions live, I generally use these queries. Feel free to change or remove the TOP N parts according to your needs.

All the dates are stored as DATETIME, so I tend to flatten them to DATEs. You can go further and group them by year, but it generally trends smallest to largest from 2008 to current. If the data dump you’re working with is from a partial year (like 2016-03), then the current year will obviously have less in it.

Badges
Comments
Users
Votes
Posts
Anyway, I hope you find these useful!

Thanks for reading!

Brent says: if you’re still stumbling along with that crappy ol’ AdventureWorks because the Stack database is too big, good news! We’ve released a new StackOverflow2010 database that’s only a 1GB download, no torrenting required, and extracts to a comfy 10GB database with data from the first few years of Stack Overflow’s existence, 2008-2010. That’s also kinda nice as a presenter because you don’t have to worry about the data changing and breaking your demos.

Previous Post
Functions Can Still Use Indexes, Kinda.
Next Post
[Video] Office Hours 2018/1/31 (With Transcriptions)

11 Comments. Leave new

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.