Stack: Dumped
Wanna learn the schema of the free Stack Overflow public database? If you want the Full Nelson, it’s all documented here. That’s a great resource if you’re looking to learn more, but generally when you’re looking at a data dump and building up demo queries, you wanna get an idea of where the data skews, and how the tables relate.
What took me forever to break out of was always using the Id column in the Users table — it’s the PK/CX, and it’s just too easy for the optimizer to make good query plans from that. To help you get the optimizer making some bad choices, here are some other ways to think of the table relationships.
This isn’t exhaustive, it’s just for the four main tables that we use when writing demo queries. (Note: I’m using > to denote relationships, not that one side is greater than the other. They’re both great. They both get a gold star on the fridge.)
Badges
1 2 3 4 |
Badges.UserId > Users.Id Badges.UserId > Posts.OwnerUserId Badges.UserId > Comments.UserId Badges.UserId > Votes.UserId (but this is largely anonymized) |
Comments
1 2 3 4 5 |
Comments.UserId > Users.Id Comments.UserId > Comments.UserId Comments.UserId > Votes.UserId (but this is largely anonymized) Comments.PostId > Posts.Id Comments.PostId > Votes.PostId |
Users
1 2 3 4 |
Users.Id > Posts.OwnerUserId Users.Id > Comments.UserId Users.Id > Badges.UserId Users.Id > Votes.UserId (but this is largely anonymized) |
Posts
1 2 3 4 5 6 |
Posts.Id > Comments.PostId Posts.Id > Votes.PostId Posts.OwnerUserId > Users.Id Posts.OwnerUserId > Badges.UserId Posts.OwnerUserId > Comments.UserId Posts.OwnerUserId > Votes.UserId (but this is largely anonymized) |
Votes
1 2 3 4 5 |
Votes.UserId > Users.Id Votes.UserId > Badges.UserId Votes.UserId > Comments.UserId Votes.UserId > Posts.OwnerUserId Votes.PostId > Comments.PostId |
Toolin’ Around
To get an idea about where interesting data distributions live, I generally use these queries. Feel free to change or remove the TOP N parts according to your needs.
All the dates are stored as DATETIME, so I tend to flatten them to DATEs. You can go further and group them by year, but it generally trends smallest to largest from 2008 to current. If the data dump you’re working with is from a partial year (like 2016-03), then the current year will obviously have less in it.
Badges
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
/*Badges*/ SELECT TOP 10 * FROM dbo.Badges; /*count by user id*/ SELECT UserId, COUNT(*) AS records FROM dbo.Badges AS b GROUP BY b.UserId ORDER BY records DESC /*count by date*/ SELECT CONVERT(DATE, [Date]) AS [Date], COUNT(*) AS records FROM dbo.Badges AS b GROUP BY CONVERT(DATE, [Date]) ORDER BY records DESC /*count by badge name*/ SELECT b.Name, COUNT(*) AS records FROM dbo.Badges AS b GROUP BY b.Name ORDER BY records DESC |
Comments
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
/*Comments*/ SELECT TOP 10 * FROM dbo.Comments; /*count by date*/ SELECT CONVERT(DATE, c.CreationDate) AS CreationDate, COUNT(*) AS records FROM dbo.Comments AS c GROUP BY CONVERT(DATE, c.CreationDate) ORDER BY records DESC /*count by post id*/ SELECT c.PostId, COUNT(*) AS records FROM dbo.Comments AS c GROUP BY c.PostId ORDER BY records DESC /*count by user id*/ SELECT c.UserId, COUNT(*) AS records FROM dbo.Comments AS c GROUP BY c.UserId ORDER BY records DESC /*top 1000 high scores*/ SELECT TOP 1000 c.Score FROM dbo.Comments AS c ORDER BY c.Score DESC |
Users
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
/*Users*/ SELECT TOP 10 * FROM dbo.Users; /*Top 10 ages*/ SELECT TOP 10 Age, COUNT(*) AS records FROM dbo.Users AS u GROUP BY u.Age ORDER BY records DESC /*count by creation date*/ SELECT CONVERT(DATE, u.CreationDate) AS CreationDate, COUNT(*) AS records FROM dbo.Users AS u GROUP BY CONVERT(DATE, u.CreationDate) ORDER BY records DESC /*top 10 reputation count*/ SELECT TOP 10 u.Reputation, COUNT(*) AS records FROM dbo.Users AS u GROUP BY u.Reputation ORDER BY records DESC /*count by last access date*/ SELECT CONVERT(DATE, u.LastAccessDate) AS CreationDate, COUNT(*) AS records FROM dbo.Users AS u GROUP BY CONVERT(DATE, u.LastAccessDate) ORDER BY records DESC |
Votes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
/*Votes*/ SELECT v.PostId, COUNT(*) AS records FROM dbo.Votes AS v GROUP BY v.PostId ORDER BY records DESC; SELECT COUNT(*) AS records FROM dbo.Votes AS v WHERE v.UserId IS NULL; SELECT COUNT(*) AS records FROM dbo.Votes AS v WHERE v.UserId IS NOT NULL; SELECT v.PostId, v.BountyAmount FROM dbo.Votes AS v WHERE v.BountyAmount IS NOT NULL GROUP BY v.PostId, v.BountyAmount ORDER BY v.BountyAmount DESC; SELECT v.UserId, v.BountyAmount, COUNT(*) AS records FROM dbo.Votes AS v WHERE v.BountyAmount IS NOT NULL AND v.UserId IS NOT NULL GROUP BY v.UserId, v.BountyAmount ORDER BY records DESC; SELECT CONVERT(DATE, v.CreationDate) AS CreationDate, COUNT(*) AS records FROM dbo.Votes AS v GROUP BY CONVERT(DATE, v.CreationDate) ORDER BY records DESC; |
Posts
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
/*Posts*/ SELECT TOP 10 * FROM dbo.Posts; /*questions with accepted answers, without accepted answers, plus total count*/ SELECT SUM(CASE WHEN p.AcceptedAnswerId = 0 THEN 1 END) AS NoAnswer, SUM(CASE WHEN p.AcceptedAnswerId > 0 THEN 1 END) AS YesAnswer, COUNT(*) AS records FROM dbo.Posts AS p; /*top 100 answer counts*/ SELECT TOP 100 p.AnswerCount FROM dbo.Posts AS p ORDER BY p.AnswerCount DESC /*count of closed dates*/ SELECT CONVERT(DATE, p.ClosedDate) AS ClosedDate, COUNT(*) AS records FROM dbo.Posts AS p GROUP BY CONVERT(DATE, p.ClosedDate) ORDER BY records DESC /*top 100 comment counts*/ SELECT TOP 100 CommentCount FROM dbo.Posts AS p ORDER BY p.CommentCount DESC /*count of community owned dates*/ SELECT CONVERT(DATE, p.CommunityOwnedDate) AS ClosedDate, COUNT(*) AS records FROM dbo.Posts AS p GROUP BY CONVERT(DATE, p.CommunityOwnedDate) ORDER BY records DESC /*count of creation dates*/ SELECT CONVERT(DATE, p.CreationDate) AS ClosedDate, COUNT(*) AS records FROM dbo.Posts AS p GROUP BY CONVERT(DATE, p.CreationDate) ORDER BY records DESC /*top 100 posts by favorite count*/ SELECT TOP 100 p.FavoriteCount FROM dbo.Posts AS p ORDER BY p.FavoriteCount DESC /*count of last activity dates*/ SELECT CONVERT(DATE, p.LastActivityDate) AS ClosedDate, COUNT(*) AS records FROM dbo.Posts AS p GROUP BY CONVERT(DATE, p.LastActivityDate) ORDER BY records DESC /*count of last edit dates*/ SELECT CONVERT(DATE, p.LastEditDate) AS ClosedDate, COUNT(*) AS records FROM dbo.Posts AS p GROUP BY CONVERT(DATE, p.LastEditDate) ORDER BY records DESC /*count of last editor ids (there's also a last editor displayname column)*/ SELECT TOP 100 p.LastEditorUserId, COUNT(*) AS records FROM dbo.Posts AS p GROUP BY p.LastEditorUserId ORDER BY records DESC /*count by owner user id*/ SELECT OwnerUserId, COUNT(*) AS records FROM dbo.Posts AS p GROUP BY p.OwnerUserId ORDER BY records DESC /*count by post type*/ SELECT p.PostTypeId, COUNT(*) AS records FROM dbo.Posts AS p GROUP BY p.PostTypeId ORDER BY records DESC /*top 100 posts by score*/ SELECT TOP 100 Score FROM dbo.Posts AS p ORDER BY p.Score DESC |
Thanks for reading!
Brent says: if you’re still stumbling along with that crappy ol’ AdventureWorks because the Stack database is too big, good news! We’ve released a new StackOverflow2010 database that’s only a 1GB download, no torrenting required, and extracts to a comfy 10GB database with data from the first few years of Stack Overflow’s existence, 2008-2010. That’s also kinda nice as a presenter because you don’t have to worry about the data changing and breaking your demos.
11 Comments. Leave new
You get a gold star!
And you get a gold star!
Erik gets a gold star!
Everyone gets a gold star!
Thank You, Erik, for, uh, un-thinking of these queries a user would write to blow up our performance.
Apparently domain knowledge is important? Who knew…
Sorry. Relationships. We can work on some bad queries.
I’m confused.
Is the sql presented performance badness or just the relationships that we can build, as listed?
Thank You
I tried to outline usable relationships between the tables (that’s the part at the top), and then give you some queries to explore the data distributions in different columns you may want to query when writing demos. The performance of those is neither here nor there 😉
Got it. Thank You, Erik. You’re awesome to all of us.
Thanks, Darrell. Tonight I’ll be drinking to remember ?
hi, thanks for sharing this useful resources.
I need to confirm about LastActivityDate, is this field refers to latest answer/comment posted against a question?
Muneeb — follow the documentation link. It has the answer.
Hey can you please answer this question relating to SEDE ?
https://stackoverflow.com/questions/58046151/get-all-posts-with-self-accepted-answers-in-sede-stack-exchange-data-explorer/58047428#58047428
You already have a correct answer on the question. Glad it worked out for you! Take care.