Time to make SQL Server demos a little more fun. Now you can download a torrent of the SQL Server database version of the Stack Overflow data dump.
- Large: current 312GB database as of 2018/09: 39GB torrent (magnet.) Expands to a ~312GB SQL Server 2008 database. Because it’s so large, we distribute it with BitTorrent – if you’re new to that, keep reading below for more detailed torrent instructions.
- Medium: 50GB database as of 2014: 10GB direct download, torrent (magnet.) Expands to a ~50GB database called StackOverflow2013. It’s the same database schema as the full size StackOverflow database, but it’s only got the 2008-2013 data.
- Small: 10GB database as of 2010: 1GB direct download, torrent (magnet.) expands to a ~10GB database called StackOverflow2010. Again, same schema as above, but it’s only got the 2008-2010 data.
As with the original data dump, this is provided under cc-by-sa 3.0 license. That means you are free to share this database and adapt it for any purpose, even commercially, but you must attribute it to the original authors (not us):
How to Get the Database
- Install a BitTorrent client – I recommend qBittorrent, an ad-free open-source client.
- Download & open this .torrent file – it’s a small metadata file that tells your BitTorrent client where to connect and start downloading the files.
- Wait. The big file may take a few hours to download depending on your internet connection and how many other people are seeding the torrent.
- Extract the .7Zip files with 7Zip – it will create the database MDF, NDFs (additional data files), LDF, and a Readme.txt file. Don’t extract the files directly into your SQL Server’s database directories – instead, extract them somewhere else first, and then move or copy them into the SQL Server’s database directories. (This just avoids permissions hassles.)
- Attach the database – it’s in Microsoft SQL Server 2008 format (2005 for the older torrents), so you can attach it to any 2008 or newer instance. It doesn’t use any Enterprise Edition features like partitioning or compression, so you can attach it to Developer, Standard, or Enterprise Edition. (If your SSMS crashes or throws permissions errors, you likely tried extracting the archive directly into the database directory, and you’ve got permissions problems on the data/log files.)
If you don’t have Internet quotas, please leave the torrent up and running – seeding the torrent helps other folks get it faster.
Why I’m Using BitTorrent
BitTorrent is a peer-to-peer file distribution system. When you download a torrent, you also become a host for that torrent, sharing your own bandwidth to help distribute the file. It’s a free way to get a big file shared amongst friends.
The download is relatively large, so it would be expensive for me to host on a server. For example, if I hosted it in Amazon S3, I’d have to pay around $1-$2 USD every time somebody downloaded the file. I like you people, but not quite enough to go around handing you dollar bills. (As it is, I’m paying for a seedbox to get this thing started.)
Some corporate firewalls understandably block BitTorrent because it can use a lot of bandwidth, and it can also be used to share pirated movies/music/software/whatever. If you have difficulty running BitTorrent from work, you’ll need to download it from home instead.
What’s Inside the StackOverflow Database
I want you to get started quickly while still keeping the database size small, so:
- All tables have a clustered index on Id, an identity field
- No other indexes are included (nonclustered or full text)
- The log file is small, and you should grow it out if you plan to build indexes or modify data
- It only includes StackOverflow.com data, not data for other Stack sites
To get started, here’s a few helpful links:
- This Meta.SE post explains the database schema.
- If you want to learn how to tune queries, Data.StackExchange.com is a fun source for queries written by other people.
- For questions about the data, check the data-dump tag on Meta.StackExchange.com.
I also keep past versions online too in case you need to see a specific version for a demo.
- 2018-06 – 38GB torrent (magnet.) Expands to a ~304GB SQL Server 2008 database. Starting with this version & newer, the giant PostHistory table is included. As you can probably guess by the name, this would make for excellent partitioning and archival demos. As you might not guess, the NVARCHAR(MAX) datatypes of the Comment and Text fields make those demos rather…challenging.
- 2017-12 – 19GB torrent (magnet.) Expands to a ~137GB SQL Server 2008 database.
- 2017-08 – 16GB torrent (magnet), 122GB SQL Server 2008 database. Starting with this version & newer, each table’s Id fields are identity fields. This way we can run real-life-style insert workloads during my Mastering Query Tuning class. (Prior to this version, the Id fields were just INTs, so you needed to select the max value or some other trick to generate your own Ids.)
- 2017-06 – 16GB torrent (magnet), 118GB SQL Server 2008 database. Starting with this torrent & newer, I broke this up into multiple SQL Server data files, each in their own 7z file, to make compression / decompression / distribution a little easier. You need all of those files to attach the database.
- 2017-01 – 14GB torrent (magnet), 110GB SQL Server 2008 database
- 2016-03 – 12GB torrent (magnet), 95GB SQL Server 2005 database
- 2015-08 – 9GB torrent (magnet), 70GB SQL Server 2005 database