I use a Microsoft SQL Server version of the public Stack Overflow data export for my blog posts and training classes because it’s way more interesting than a lot of sample data sets out there. It’s easy to learn, has just a few easy-to-understand tables, and has real-world data distributions for numbers, dates, and strings. Plus, it’s open source and no charge for you – just choose your size:
- Small: 10GB database as of 2010: 1GB direct download, or torrent or magnet. Expands to a ~10GB database called StackOverflow2010 with data from the years 2008 to 2010. If all you need is a quick, easy, friendly database for demos, and to follow along with code samples here on the blog, this is all you probably need.
- Medium: 50GB database as of 2014: 10GB direct download, or torrent or magnet. Expands to a ~50GB database called StackOverflow2013 with data from 2008 to 2013 data. I use this in my Fundamentals classes because it’s big enough that slow queries will actually be kinda slow.
- Large 150GB database as of 2019: 18GB torrent or magnet. Expands to a ~150GB SQL Server 2016 database built by Erik Darling. Does not include the PostHistory, PostLinks, and LinkTypes tables because those are pretty rarely used, and row compression has been applied to all of the clustered indexes for larger tables.
- Extra-Large: current 381GB database as of 2020/06: 46GB torrent (magnet.) Expands to a ~381GB SQL Server 2008 database. Because it’s so large, I only distribute it with BitTorrent, not direct download links.
After you download it, extract the .7Zip files with 7Zip. (I use that for max compression to keep the downloads a little smaller.) The extract will have the database MDF, NDFs (additional data files), LDF, and a Readme.txt file. Don’t extract the files directly into your SQL Server’s database directories – instead, extract them somewhere else first, and then move or copy them into the SQL Server’s database directories. You’re going to screw up the database over time, and you’re going to want to start again – keep the original copy so you don’t have to download it again.
Then, attach the database. It’s in Microsoft SQL Server 2008 format (2005 for the older torrents), so you can attach it to any 2008 or newer instance. It doesn’t use any Enterprise Edition features like partitioning or compression, so you can attach it to Developer, Standard, or Enterprise Edition. If your SSMS crashes or throws permissions errors, you likely tried extracting the archive directly into the database directory, and you’ve got permissions problems on the data/log files.
As with the original data dump, this is provided under cc-by-sa 4.0 license. That means you are free to share this database and adapt it for any purpose, even commercially, but you must attribute it to the original authors (not me):
What’s Inside the StackOverflow Database
I want you to get started quickly while still keeping the database size small, so:
- All tables have a clustered index on Id, an identity field
- No other indexes are included (nonclustered or full text)
- The log file is small, and you should grow it out if you plan to build indexes or modify data
- It only includes StackOverflow.com data, not data for other Stack sites
To get started, here’s a few helpful links:
- This Meta.SE post explains the database schema.
- If you want to learn how to tune queries, Data.StackExchange.com is a fun source for queries written by other people.
- For questions about the data, check the data-dump tag on Meta.StackExchange.com.
I also keep past versions online too in case you need to see a specific version for a demo.
- 2020-06 – 46GB torrent (magnet.) Expands to a ~381GB SQL Server 2008 database. Yes, smaller torrent and larger database because I went wild and crazy with the compression. Took freakin’ 36 hours to compress.
- 2019-12 – 52GB torrent (magnet.) Expands to a ~361GB SQL Server 2008 database.
- 2019-09 – 43GB torrent (magnet.) Expands to a ~352GB SQL Server 2008 database. This is the last export licensed with the cc-by-sa 3.0 license.
- 2019-06 – 40GB torrent (magnet.) Expands to a ~350GB SQL Server 2008 database.
- 2018-12 – 41GB torrent (magnet.) Expands to a ~323GB SQL Server 2008 database.
- 2018-09 – 39GB torrent (magnet.) Expands to a ~312GB SQL Server 2008 database.
- 2018-06 – 38GB torrent (magnet.) Expands to a ~304GB SQL Server 2008 database. Starting with this version & newer, the giant PostHistory table is included. As you can probably guess by the name, this would make for excellent partitioning and archival demos. As you might not guess, the NVARCHAR(MAX) datatypes of the Comment and Text fields make those demos rather…challenging.
- 2017-12 – 19GB torrent (magnet.) Expands to a ~137GB SQL Server 2008 database.
- 2017-08 – 16GB torrent (magnet), 122GB SQL Server 2008 database. Starting with this version & newer, each table’s Id fields are identity fields. This way we can run real-life-style insert workloads during my Mastering Query Tuning class. (Prior to this version, the Id fields were just INTs, so you needed to select the max value or some other trick to generate your own Ids.)
- 2017-06 – 16GB torrent (magnet), 118GB SQL Server 2008 database. Starting with this torrent & newer, I broke this up into multiple SQL Server data files, each in their own 7z file, to make compression / decompression / distribution a little easier. You need all of those files to attach the database.
- 2017-01 – 14GB torrent (magnet), 110GB SQL Server 2008 database
- 2016-03 – 12GB torrent (magnet), 95GB SQL Server 2005 database
- 2015-08 – 9GB torrent (magnet), 70GB SQL Server 2005 database
Why are Some Sizes/Versions Only On BitTorrent?
BitTorrent is a peer-to-peer file distribution system. When you download a torrent, you also become a host for that torrent, sharing your own bandwidth to help distribute the file. It’s a free way to get a big file shared amongst friends.
The download is relatively large, so it would be expensive for me to host on a server. For example, if I hosted it in Amazon S3, I’d have to pay around $5 USD every time somebody downloaded the file. I like you people, but not quite enough to go around handing you dollar bills. (As it is, I’m paying for multiple seedboxes to keep these available, heh.)
Some corporate firewalls understandably block BitTorrent because it can use a lot of bandwidth, and it can also be used to share pirated movies/music/software/whatever. If you have difficulty running BitTorrent from work, you’ll need to download it from home instead.