I use a Microsoft SQL Server version of the public Stack Overflow data export for my blog posts and training classes because it’s way more interesting than a lot of sample data sets out there. It’s easy to learn, has just a few easy-to-understand tables, and has real-world data distributions for numbers, dates, and strings. Plus, it’s open source and no charge for you – just choose your size:
- Small: 10GB database as of 2010: 1GB direct download, or torrent or magnet. Expands to a ~10GB database called StackOverflow2010 with data from the years 2008 to 2010. If all you need is a quick, easy, friendly database for demos, and to follow along with code samples here on the blog, this is all you probably need.
- Medium: 50GB database as of 2013: 10GB direct download, or torrent or magnet. Expands to a ~50GB database called StackOverflow2013 with data from 2008 to 2013 data. I use this in my Fundamentals classes because it’s big enough that slow queries will actually be kinda slow.
- Large 150GB database as of 2019: 18GB torrent or magnet. Expands to a ~150GB SQL Server 2016 database built by Erik Darling. Does not include the PostHistory, PostLinks, and LinkTypes tables because those are pretty rarely used, and row compression has been applied to all of the clustered indexes for larger tables.
- Extra-Large: current 381GB database as of 2020/06: 46GB torrent (magnet.) Expands to a ~381GB SQL Server 2008 database. Because it’s so large, I only distribute it with BitTorrent, not direct download links.
After you download it, extract the .7Zip files with 7Zip. (I use that for max compression to keep the downloads a little smaller.) The extract will have the database MDF, NDFs (additional data files), LDF, and a Readme.txt file. Don’t extract the files directly into your SQL Server’s database directories – instead, extract them somewhere else first, and then move or copy them into the SQL Server’s database directories. You’re going to screw up the database over time, and you’re going to want to start again – keep the original copy so you don’t have to download it again.
Then, attach the database. It’s in Microsoft SQL Server 2008 format (2005 for the older torrents), so you can attach it to any 2008 or newer instance. It doesn’t use any Enterprise Edition features like partitioning or compression, so you can attach it to Developer, Standard, or Enterprise Edition. If your SSMS crashes or throws permissions errors, you likely tried extracting the archive directly into the database directory, and you’ve got permissions problems on the data/log files.
As with the original data dump, this is provided under cc-by-sa 4.0 license. That means you are free to share this database and adapt it for any purpose, even commercially, but you must attribute it to the original authors (not me):
What’s Inside the StackOverflow Database
I want you to get started quickly while still keeping the database size small, so:
- All tables have a clustered index on Id, an identity field
- No other indexes are included (nonclustered or full text)
- The log file is small, and you should grow it out if you plan to build indexes or modify data
- It only includes StackOverflow.com data, not data for other Stack sites
To get started, here’s a few helpful links:
- This Meta.SE post explains the database schema.
- If you want to learn how to tune queries, Data.StackExchange.com is a fun source for queries written by other people.
- For questions about the data, check the data-dump tag on Meta.StackExchange.com.
Past Versions
I also keep past versions online too in case you need to see a specific version for a demo.
- 2020-06 – 46GB torrent (magnet.) Expands to a ~381GB SQL Server 2008 database. Yes, smaller torrent and larger database because I went wild and crazy with the compression. Took freakin’ 36 hours to compress.
- 2019-12 – 52GB torrent (magnet.) Expands to a ~361GB SQL Server 2008 database.
- 2019-09 – 43GB torrent (magnet.) Expands to a ~352GB SQL Server 2008 database. This is the last export licensed with the cc-by-sa 3.0 license.
- 2019-06 – 40GB torrent (magnet.) Expands to a ~350GB SQL Server 2008 database.
- 2018-12 – 41GB torrent (magnet.) Expands to a ~323GB SQL Server 2008 database.
- 2018-09 – 39GB torrent (magnet.) Expands to a ~312GB SQL Server 2008 database.
- 2018-06 – 38GB torrent (magnet.) Expands to a ~304GB SQL Server 2008 database. Starting with this version & newer, the giant PostHistory table is included. As you can probably guess by the name, this would make for excellent partitioning and archival demos. As you might not guess, the NVARCHAR(MAX) datatypes of the Comment and Text fields make those demos rather…challenging.
- 2017-12 – 19GB torrent (magnet.) Expands to a ~137GB SQL Server 2008 database.
- 2017-08 – 16GB torrent (magnet), 122GB SQL Server 2008 database. Starting with this version & newer, each table’s Id fields are identity fields. This way we can run real-life-style insert workloads during my Mastering Query Tuning class. (Prior to this version, the Id fields were just INTs, so you needed to select the max value or some other trick to generate your own Ids.)
- 2017-06 – 16GB torrent (magnet), 118GB SQL Server 2008 database. Starting with this torrent & newer, I broke this up into multiple SQL Server data files, each in their own 7z file, to make compression / decompression / distribution a little easier. You need all of those files to attach the database.
- 2017-01 – 14GB torrent (magnet), 110GB SQL Server 2008 database
- 2016-03 – 12GB torrent (magnet), 95GB SQL Server 2005 database
- 2015-08 – 9GB torrent (magnet), 70GB SQL Server 2005 database
Why are Some Sizes/Versions Only On BitTorrent?
BitTorrent is a peer-to-peer file distribution system. When you download a torrent, you also become a host for that torrent, sharing your own bandwidth to help distribute the file. It’s a free way to get a big file shared amongst friends.
The download is relatively large, so it would be expensive for me to host on a server. For example, if I hosted it in Amazon S3, I’d have to pay around $5 USD every time somebody downloaded the file. I like you people, but not quite enough to go around handing you dollar bills. (As it is, I’m paying for multiple seedboxes to keep these available, heh.)
Some corporate firewalls understandably block BitTorrent because it can use a lot of bandwidth, and it can also be used to share pirated movies/music/software/whatever. If you have difficulty running BitTorrent from work, you’ll need to download it from home instead.
113 Comments. Leave new
Thanks
Thx for sharing and kudos for choosing torrent.
Hi there. For those who want to play with the data on their system, but via the Stack Exchange Data Explorer UI / WebApp and not boring ‘ol SSMS, you can run that locally as well :). Just download the WebApp at:
https://github.com/StackExchange/StackExchange.DataExplorer
So I’m curious…and in my googling around, I couldn’t find what I thought was a good answer. Why is the StackOverflow Database set up as case sensitive? It was actually a nice little learning opportunity for me as I didn’t even know you could make column names etc., case sensitive. Thanks.
Hi Dayton. That would be a good question to ask on: http://meta.stackexchange.com/ (be sure to tag it with [data-explorer] ).
For now, here is a related question from there that doesn’t answer why case-sensitivity was chosen, but how to deal with it (be sure to read the comments as well 🙂
http://meta.stackexchange.com/questions/119304/why-is-the-like-operator-case-sensitive-in-data-explorer
Dayton – it’s a function of the database importer. You set the collation when you create a database. I use case sensitivity on all of my database servers because I want my scripts to work on everyone’s servers, and there’s a surprisingly large number of case-sensitive servers out there. Forcing my stuff to be case-sensitive from the start means I get less support calls on my scripts down the road.
Well that’s a very good answer then Brent. I’m curious though. How do you handle text searching. StackOverflow is super fast. But there is no way it would be that fast if you are UPPERING() or LOWERING() all of the text searches. That’s a quick way to defenestrate the sargability. Although, I have to admit, I’m a bit of a padawan when it comes to SQL. Side note, it would be pretty cool if you could include the actual indexes used (as a script) Thanks!
Dayton – the text search is done in ElasticSearch.
When you say the actual indexes, can you elaborate on what you mean?
Thanks Brent. The actual indexes that are on the production DB. I’m guessing there are more than just a clustered index on each table. Something like a script that has all the create index statements.
Dayton – ah, no, that wouldn’t be appropriate.
Okey Dokey. Thanks for the db/torrent etc.
I don’t understand how then we are to use this with the stress test Random_Q SP if none of the tables have indexes.
Luis – that’s part of your job – to tune them by finding the right indexes during your load tests. (Hey, if I gave you a perfect database with perfect queries and perfect indexes, then you wouldn’t learn anything, hahaha.)
That’s amazing! Thank you, Brent, for sharing this!
Denis – you’re welcome, glad I could help.
Why is this a zipped MDF & LDF and not a BAK file? I would have thought a full database backup would have been a more natural way to distribute the data set than an unattached MDF/LDF.
Wyatt – great question! Because in order to get the file size down, I would still have to compress the backup with 7Z. That means you would have to have enough space for the 7Z, the backup, and the MDF/LDF. The extraction time would also be longer, because it takes you a long time to restore a 70GB database. This way, you need less space (just the 7Z and MDF/LDF), and much less time (extract, then attach).
I’m seeding this now, I’ll let it run as long as the wife doesn’t yell at me for the internet being slow.
Hahaha, thanks!
I have following error during extract.
“Unsupported compression method error for the file…”
Does it mean that I need to download the file once again?
When I’m checking the file, the method is set to 21.
Listing archive: F:\StackOverflow201508.
Method = 21
Solid = –
Blocks = 3
Physical Size = 9240856556
Headers Size = 228
———-
Path = Readme.txt
Size = 1365
Packed Size = 788
Modified = 2015-09-27 16:06:56
Attributes = ….A
CRC = BC21062A
Encrypted = –
Method = 21
Block = 0
Path = StackOverflow.mdf
Size = 69038243840
Packed Size = 9206452306
Modified = 2015-09-27 15:59:26
Attributes = ….A
CRC = 7E857A08
Encrypted = –
Method = 21
Block = 1
Path = StackOverflow_log.ldf
Size = 524165120
Packed Size = 34403234
Modified = 2015-09-27 15:59:26
Attributes = ….A
CRC = 330FDAC2
Encrypted = –
Method = 21
Block = 2
Krzysztof – sorry, I can’t troubleshoot that for you. You may want to try a different extraction tool as well.
Added to my 1 gigabit seedbox. I’ll have it up for at least a month, so it should help anyone who wants to get it.
Thanks sir!
I noticed that the most recent users and posts in your data dump are from September 14th **2014**, not 2015. So this snapshot is over a year old, correct?
There are some missing tables like CloseAsOffTopicReasonTypes, TagSynonyms . Any thoughts?
Jay – they’re not in the public data dump, right?
@Brent Ozar.
There are many missing tables especially linked/lookup tables.
I have download 2019 from your website and hope to find missing tables but they do not exist.
I have a script to create all tables but it would be nice if those table get filled with data as well.
This is a link to my project. The idea is convert some interesting query to LINQ to EF.
https://github.com/codesanook/CodeSanook.StackOverflowEFQuery/blob/master/create-a-database.sql
Thank you so much.
@Aaron – the tables aren’t missing – they’re not given out by Stack Overflow. To see the tables that Stack Overflow makes public, click the related link in the post to see their original data export.
@Brent Ozar
Thank you so much for your reply. Thank you so much for your blog/article/videos. It is very helpful.
I am thinking to join your class soon.
BTW, way this link contains all StackOverflow tables (Schema only)
https://github.com/codesanook/CodeSanook.StackOverflowEFQuery/blob/master/create-a-database.sql
However, if we want to have some data in those tables, do you mean we need to work out by ourselves.
My goal is to convert some insetting SQL query from “https://data.stackexchange.com/stackoverflow/queries” to LINQ to EF.
I think it is very interesting to learn how create real world meaningful query and pretty complex LINQ to EF.
OK, cool, good luck!
Helloo,
Does this mean we have the source code of the website ? Or it is juste the database, what can we do with this database ? Thank you
Just the database. If you don’t know what you can do with a database, then this isn’t really for you. Thanks!
I am trying to download the BiTtorrent file to follow you on the training videos, however its been trying to connect to peers over two hours? Can i also get the same file from https://archive.org/details/stackexchange.
James – unfortunately we can’t troubleshoot that remotely. It’s working fine here though.
I have downloaded the file and extracted it. But when I try to attach it I get It asks for full-text catalog to be added but there is no one.
How can I solve this?
Shaui – there isn’t one. You can skip that part.
Thanks.
My mistake.
The real problem was database limitation because the instance was sqlexpress and the limit is 10 gig database.
I am moving the files to another server.
There’s two different torrent files linked in the article. The top one is newer (going by the name).
Thanks for compiling and publishing this DB – it’s a great help for learning!
James – great catch! Fixed. Thanks!
Uh oh, doesn’t look like either of the torrent links works at this time.
Working fine here. Your office may be blocking BitTorrent.
Seems there aren’t enough seeds? FrostWire is telling me the download will be completely in an infinite number of days, hours, minutes and seconds from now!! I’ve managed to get 57.7KB so far…..
completely = completed
Still pointing to the old (201603) torrent
in the download instructions that is
Thanks, fixed.
Thanks Brent Ozar team! This is helpful as always. Do you have any recommendations for free sample databases that are a bit smaller, such as 10-20 GB? I like the Stack dump, but it’s now at the point where it’s surpassing the size of many laptop SSDs, which is where I’d like to mess around with it.
M – you can buy a 1TB laptop SSD for under $250:
http://amzn.to/2l0bMNz
Here is a direct download link: http://ovh.to/D4JmSb6 (hosted on Hubic in France)
WQW – awesome, thanks! I’ve added that to the post.
[…] StackOverflow sample database, shared on BitTorrent by Brent Ozar […]
Is it possible to use the stackoverflow database in SQL Server 2016?
Yep, 2005 and up.
Hey Brent,
We enjoy and learn more with the StackOverFlow DB. Thanks for the great work.. Post downloading we rebuilt the clustered index as Clustered Columstore store and size was reduced by around 30%. Would it possible to add in the future torrents release? which would save us a lot of space..
Kannan – glad you like it. No, not all versions support columnstore indexes.
Why there are only 9 tables? Did I set up the database in a wrong way?
The tables I can see are:
PostTypes
Votes
Users
Posts
Comments
Badges
PostLinks
LinkTypes
VoteTypes
Yep. What other tables were you expecting to see, just curious?
The original database on Data.StackExchange has many more tables. One of them is the dbo.Tags, which is used in one of your first lessons. I wanted to use these lessons to perform live SQL training for my colleagues.
I was expecting to see all of them.
I might reconsider using the AdventureWorks database (or the new MS training db?)
I might use the SE database for some query tuning training or something like that.
Hmmm, which lesson? I don’t remember using an existing dbo.Tags table (although I know in one of my lessons, I have you CREATE one.) Can you point me to it? I want to make sure I get that fixed. Thanks!
Also, to be clear – the Data.StackExchange.com database is a backup of production, whereas the process to create the public data dump (XML) is a little different, and has never contained all of the production tables/columns.
Brent, it’s this lesson:
https://www.brentozar.com/learn-query-sql-server-stackoverflow-database/learn-query-part-1-getting-data-select/from-getting-data-table/
Gotcha. That class is specifically designed for folks using Data.StackExchange.com, not the Stack Overflow data dump. (You’ll notice that the instructions all focus on Data.StackExchange.com.) That’s different from the data dump.
The StackExchange is an excellent DBA “toy”. Thank you for providing an easy way to access it Brent!
One question that I am curious about: how do you import the XML files into your SQL Server?
Jakob – we use the Stack Overflow Data Dump Importer: https://github.com/BrentOzarULTD/soddi
are these datasets contains the question? and tags assigned to them or not?
Yep!
[…] database on your machine to play around, check out Brent Ozar’s instructions on how to do it here). My laptop is running Crucial MX200 1TB SSD. My throughput is 490.491 MB / sec (see snapshot […]
Hi
Thx for this, just downloaded and attached to a SQL Server 2017 instance.
I see that non-clustered-indexes has been removed, but does that also go for foreign key constraints? Or are there none of those in the prod-DB?
Just wanted to mention: when I attached the mdf-file (the first one, there were 3 .mdf-files in the download), it complained if the log-file from the download already was in the log-file-folder. Seems it wanted to create it.
Gert – the data dump isn’t a direct backup of Stack Overflow’s production database. They export the data to XML, and then we import it into SQL Server format. The tables aren’t necessarily identical in structure to Stack’s live schema – it’s very highly similar, but not identical.
Sounds like you didn’t quite follow the process of normal database attachment, but that’s totally okay. Have fun with it!
[…] How to Download the Stack Overflow Database via BitTorrent […]
[…] How to Download the Stack Overflow Database via BitTorrent […]
[…] os exemplos, utilizarei o banco de dados StackOverflow2010, clique aqui para mais informações e download desse banco de dados de […]
[…] using the free Stack Overflow database, and I wanna find all of the users who have not left a comment. The tables involved […]
Thank u for this great Information
[…] was using the copy of StackOverflow’s database which Brent Ozar maintains as a BitTorrent, here. Once I spent all the time to download the database and then export it in a format which is easy to […]
[…] It’s a bit painful to get in database form, please see Brent Ozar’s blog for instructions […]
[…] (Download StackOverflow2010 here) […]
Thank you Brent… Firewall at work blocks all links and it makes it look like they are broken.
[…] ?uradan Stack Overflow’un 2008-2010 y?llar?na ait veritab?n?n? indirip, .mdf uzant?l? dosyay?, Attach i?lemi ile SQL Server içerisinde olu?turuyorum. […]
Hi!
Is there a way to connect this sql server to some sort of web-gui that can be set up on-premise, for an offline stackoverflow.com alternative?
Yep! That GUI is SQL Server Management Studio or Azure Data Studio, and you can run whatever queries you like against the database to get your answers while you’re offline.
[…] Download | Visit StackOverflow Page […]
[…] The Stack Overflow database is periodically published as an XML data dump and Brent Ozar uses it as part of his performance tuning courses and so being the super helpful guy he is carves it up into several different sized backup files. What I find most helpful about this, is given that the structure is consistent it means I can test stuff out at speed against one of the smaller databases before unleashing it on one of the larger datasets. There are tonnes of versions available, but the ones I have used for this test can be found on this page. […]
[…] be using the StackOverflow 2014 data dump for these examples if you want to play along at […]
[…] my series to document ways of refactoring queries for improved performance. I’ll be using the StackOverflow 2014 data dump for these examples if you want to play along at […]
[…] my series to document ways of refactoring queries for improved performance. I’ll be using the StackOverflow 2014 data dump for these examples if you want to play along at […]
[…] to document ways of refactoring queries for improved performance. I’ll be using the StackOverflow 2014 data dump for these examples if you want to play along at […]
I am running into a virus issue when trying to download the qbittorrent. Any help on this?
Sorry, this file is infected with a virus
Only the owner is allowed to download infected files.
Sure, your antivirus client may not like that torrent app, and you may need to try a different one.
Anyone still seeding the big boy (40GBs)? Want to get the full meal deal but no peers. 🙁
Yes, I’ve got a few seed boxes – you may just be firewalled off.
Switched to a different torrent client. Working now thanks!
I am seeing the same thing right now. Can you download one of the smaller (older) version through torrent as a way of testing for firewalls and other limitations?
[…] a lightweight clone of the “production” database (I’m using a copy of StackOverflow, thanks Brent & the folks at […]
[…] a lightweight clone of the “production” database (I’m using a copy of StackOverflow, thanks Brent & the folks at […]
It seems like the 350GB DB is only 92 GB? when decompressed? Or am I doing something wrong?
Sounds like you’re doing something wrong.
Thanks for the quick reply. I was already in the process of hitting myself.
eh… iam doing something wrong here…… so nevermind… i will start hitting myself
[…] test this out, I use the Stack Overflow data (~10GB) provided by Brent Ozar with a simple setup as […]
[…] ?????? ???????????? ????? (~ 10 ??), ??????????????? ????? ???? ? ??????? ?????????? ????????? […]
[…] test this out, I use the Stack Overflow data (~10GB) provided by Brent Ozar with a simple setup as […]
[…] test this out, I use the Stack Overflow data (~10GB) provided by Brent Ozar with a simple setup as […]
[…] version of the StackOverflow2013 database, which contains 2.3million user records. You can get from here if you want to try this out […]
[…] used the sample Stack Overflow database (50 GB) for the […]
Is it necessary to use 7-zip to unpack the SO databases? Will WinZip work, or is the compression/de-compression specific to 7-zip only? I ask because the 7-zip.org site is blocked by our url filters because its been identified as having malware, spyware, or phishing.
I’m not familiar with whether WinZip extracts 7z files, sorry.
WinZip Pro worked.
I am having issues trying to download the Stack Overflow Database
Okay, can you be more specific, like with error messages?
Nevermind I got it figure out.Thanks
Brent thought I got it figured out ,I am able to download the stack overflow Database but I can’t unzip it and attach to my management studio, Is it possible to get a BAK file?
Loren – if you have a Live Class Season Pass, you can go to the class prerequisites page for Mastering Index Tuning, follow the instructions there, and get a bak file.