Time to make SQL Server demos a little more fun. Now you can download a torrent of the SQL Server database version of the Stack Overflow data dump.
- Full size circa 2017/12: 19GB torrent (magnet.) Expands to a ~137GB SQL Server 2008 database. Because it’s so large, we distribute it with BitTorrent – if you’re new to that, keep reading below for more detailed torrent instructions.
- Mini size circa 2010/12: 1GB direct download, expands to a ~10GB database called StackOverflow2010. It’s the same database schema as the full size StackOverflow database, but it’s only got the 2008-2010 data.
- SuperUser.com 2017/12 database – 1GB backup file, expands to a ~3GB SQL Server 2008 database. SuperUser has the same database schema as Stack Overflow, but it’s just a much smaller database since the site has less activity.
As with the original data dump, this is provided under cc-by-sa 3.0 license. That means you are free to share this database and adapt it for any purpose, even commercially, but you must attribute it to the original authors (not us):
How to Get the Database
- Install a BitTorrent client – I recommend qBittorrent, an ad-free open-source client.
- Download & open this .torrent file – it’s a small metadata file that tells your BitTorrent client where to connect and start downloading the files.
- Wait. The big file may take a few hours to download depending on your internet connection and how many other people are seeding the torrent.
- Extract the .7Zip files with 7Zip – it will create the database MDF, NDFs (additional data files), LDF, and a Readme.txt file. Don’t extract the files directly into your SQL Server’s database directories – instead, extract them somewhere else first, and then move or copy them into the SQL Server’s database directories. (This just avoids permissions hassles.)
- Attach the database – it’s in Microsoft SQL Server 2008 format (2005 for the older torrents), so you can attach it to any 2008 or newer instance. It doesn’t use any Enterprise Edition features like partitioning or compression, so you can attach it to Developer, Standard, or Enterprise Edition. (If your SSMS crashes or throws permissions errors, you likely tried extracting the archive directly into the database directory, and you’ve got permissions problems on the data/log files.)
Please leave the torrent up and running – seeding the torrent helps other folks get it faster.
Why I’m Using BitTorrent
BitTorrent is a peer-to-peer file distribution system. When you download a torrent, you also become a host for that torrent, sharing your own bandwidth to help distribute the file. It’s a free way to get a big file shared amongst friends.
The download is relatively large, so it would be expensive for me to host on a server. For example, if I hosted it in Amazon S3, I’d have to pay around $1-$2 USD every time somebody downloaded the file. I like you people, but not quite enough to go around handing you dollar bills. (As it is, I’m paying for a seedbox to get this thing started.)
Some corporate firewalls understandably block BitTorrent because it can use a lot of bandwidth, and it can also be used to share pirated movies/music/software/whatever. If you have difficulty running BitTorrent from work, you’ll need to download it from home instead.
What’s Inside the StackOverflow Database
I want you to get started quickly while still keeping the database size small, so:
- All tables have a clustered index on Id, an identity field
- No other indexes are included (nonclustered or full text)
- The log file is small, and you should grow it out if you plan to build indexes or modify data
- It only includes StackOverflow.com data, not data for other Stack sites
To get started, here’s a few helpful links:
- This Meta.SE post explains the database schema.
- If you want to learn how to tune queries, Data.StackExchange.com is a fun source for queries written by other people.
- For questions about the data, check the data-dump tag on Meta.StackExchange.com.
I keep past versions online too in case you need to see a specific version for a demo.
- 2017-08 – 16GB torrent (magnet), 122GB SQL Server 2008 database. Starting with this version & newer, each table’s Id fields are identity fields. This way we can run real-life-style insert workloads during my Mastering Query Tuning class. (Prior to this version, the Id fields were just INTs, so you needed to select the max value or some other trick to generate your own Ids.)
- 2017-06 – 16GB torrent (magnet), 118GB SQL Server 2008 database. Starting with this torrent & newer, I broke this up into multiple SQL Server data files, each in their own 7z file, to make compression / decompression / distribution a little easier. You need all of those files to attach the database.
- 2017-01 – 14GB torrent (magnet), 110GB SQL Server 2008 database
- 2016-03 – 12GB torrent (magnet), 95GB SQL Server 2005 database
- 2015-08 – 9GB torrent (magnet), 70GB SQL Server 2005 database