Stack Overflow, the place where most of your production code comes from, publicly exports their data every couple/few months. @TarynPivots (their DBA) tweets about it, and then I pull some levers and import the XML data dump into SQL Server format.
Stack Overflow’s database makes for great blog post examples because it’s real-world data: real data distributions, lots of different data types, easy to understand tables, simple joins. Some of the tables include:
- Users – now up over 14 million rows
- Posts – over 52 million rows’ worth of questions & answers, 143GB in just the clustered index alone
- Votes – over 208 million rows, making for fun calculations and grouping demos
This isn’t the exact same data structure as Stack Overflow’s current database – they’ve changed their own database over the years, but they still provide the data dump in the same style as the original site’s database, so your demo queries still work over time. If you’d like to find demo queries or find inspiration on queries to write, check out Data.StackExchange.com, a public query repository.
I distribute the database over BitTorrent because it’s so large. To get it, open the torrent file or magnet URL in your preferred BitTorrent client, and the 53GB download will start. After that finishes, you can extract it with 7Zip to get the SQL Server 2016 database. It’s 4 data files and 1 log file, adding up to a ~411GB database.
Want a smaller version to play around with?
- Small: 10GB database as of 2010: 1GB direct download, or torrent or magnet. Expands to a ~10GB database called StackOverflow2010 with data from the years 2008 to 2010. If all you need is a quick, easy, friendly database for demos, and to follow along with code samples here on the blog, this is all you probably need.
- Medium: 50GB database as of 2013: 10GB direct download, or torrent or magnet. Expands to a ~50GB database called StackOverflow2013 with data from 2008 to 2013 data. I use this in my Fundamentals classes because it’s big enough that slow queries will actually be kinda slow.
- For my training classes: specialized copy as of 2018/06: 47GB torrent (magnet.) Expands to a ~180GB SQL Server 2016 database with queries and indexes specific to my training classes. Because it’s so large, I only distribute it with BitTorrent, not direct download links.
If you only have a limited amount of bandwidth, you don’t have to keep seeding the database after you get it – I’ve got it hosted on a handful of seedboxes around the world.
As with the original data dump, these are provided under cc-by-sa 4.0 license. That means you are free to share it and adapt it for any purpose, even commercially, but you must attribute it to the original authors (not me):
Excellent – thank you for supplying these very useful database dumps 🙂
Thank You Brent
Can you please let me know the time period of this dataset? Like from 2014-2016 ?
It’s since the beginning of Stack Overflow.
Hi Brent I downloaded 4 .bak files for the stackoverflow db from your site. StackOverflow_1of4.bak..
How do i import it to sql? are they one backup split into 4? how do I merge them to restore to sql? Just starting to explore your training videos. thanks
Hi! That’s for the Mastering classes. Read the prerequisites module for the Mastering class you’re enrolled in.
Hi Brent? I downloaded Medium: 50GB database as of 2013?in the user form,why LastAcessDate is later than 2013,for example:2018-08-28?In the user form,is the reputation the value in 2013 or 2018?
The database was created after 2013. To create it, we just deleted all *new* users, posts, comments, etc that had CreationDates after 2013.
Hi Brent, I just put the medium dataset online for browsing at https://squil.azurewebsites.net.
I use it as an example for SQuiL, my SQL Server database browser.
Thanks for the dumps!