Stack Overflow, the place where most of your production code comes from, publicly exports their data every couple/few months. @TarynPivots (their DBA) tweets about it, and then I pull some levers and import the XML data dump into SQL Server format.
Stack Overflow’s database makes for great blog post examples because it’s real-world data: real data distributions, lots of different data types, easy to understand tables, simple joins. Some of the tables include:
- Badges: 40,338,942 rows; 592.8MB
- Comments: 80,673,644 rows; 14.4GB
- PostHistory: 141,277,451 rows; 242.0GB; 221.9GB LOB
- Posts: 53,086,328 rows; 137.6GB; 26.1GB LOB
- Users: 14,839,627 rows; 1.4GB; 4.5MB LOB
- Votes: 213,555,899 rows; 2.8GB, making for fun calculations and grouping demos
This isn’t the exact same data structure as Stack Overflow’s current database – they’ve changed their own database over the years, but they still provide the data dump in the same style as the original site’s database, so your demo queries still work over time. If you’d like to find demo queries or find inspiration on queries to write, check out Data.StackExchange.com, a public query repository.
New this month: I built it with page-level database compression, which requires SQL Server 2016 Service Pack 1 or newer (but doesn’t require Enterprise Edition.) I don’t have a before-and-after across all of the tables, but the Badges table was 2GB before, and 0.5GB afterwards. Woohoo! Every little bit helps, especially with the database size.
I distribute the database over BitTorrent because it’s so large. To get it, open the torrent file or magnet URL in your preferred BitTorrent client, and the 54GB download will start. After that finishes, you can extract it with 7Zip to get the SQL Server 2016 database. It’s 4 data files and 1 log file, adding up to a ~401GB database.
Want a smaller version to play around with?
- Small: 10GB database as of 2010: 1GB direct download, or torrent or magnet. Expands to a ~10GB database called StackOverflow2010 with data from the years 2008 to 2010. If all you need is a quick, easy, friendly database for demos, and to follow along with code samples here on the blog, this is all you probably need.
- Medium: 50GB database as of 2013: 10GB direct download, or torrent or magnet. Expands to a ~50GB database called StackOverflow2013 with data from 2008 to 2013 data. I use this in my Fundamentals classes because it’s big enough that slow queries will actually be kinda slow.
- For my training classes: specialized copy as of 2018/06: 47GB torrent (magnet.) Expands to a ~180GB SQL Server 2016 database with queries and indexes specific to my training classes. Because it’s so large, I only distribute it with BitTorrent, not direct download links.
As with the original data dump, these are provided under cc-by-sa 4.0 license. That means you are free to share it and adapt it for any purpose, even commercially, but you must attribute it to the original authors (not me):