New Stack Overflow Public Database Available (2018-06)

Stack Overflow
5 Comments

@Taryn and the kind folks at Stack Overflow publish their data export periodically with your questions, answers, comments, user info, and more. It’s available as an XML data dump, which I then take and import into SQL Server for teaching performance tuning.

You can download the 38GB torrent (magnet), which gives you a series of 7Zip files that you can extract to produce a ~304GB SQL Server 2008 database. You can then attach it to any 2008-2017 SQL Server. (If you want a smaller 10GB version, check out the mini 10GB version circa 2010.

Stack Overflow
The place that saves your job

The data goes up to 2018-06-03 and includes:

  • Badges – 27M rows, 1GB data.
  • Comments – 66M rows, 22GB
  • PostHistory -106M rows, 174GB (new)
  • Posts – 41M rows, 106GB
  • Users – 9M rows, 1GB.
  • Votes – 151M rows, 5GB.
  • And a few smaller supporting tables: PostHistoryTypes, PostLinks, PostTypes, and Tags.

As always, it’s open source, licensed under Creative Commons Attribution-Share Alike 3.0, as is the source data dump.

The new PostHistory table makes this one way bigger.

The Posts table holds questions and answers. The PostHistory table tracks events that have happened to a post over time, like edits. One of my favorite features of the Stack sites is the ability for people to edit each others’ questions and answers. You can jump right in and improve someone’s question by adding more details, or edit their answers, too.

To join them together:

Results for a very active question:

PostHistory

I haven’t included this table in the past because it’s so large, and I wanted people to be able to do Stack query demos on smaller machines. However, we’ve solved that problem by distributing a separate 10GB StackOverflow2010 database (1GB 7zip) showing data from 2008-2010.

To learn more about using the database:

  • BrentOzar.com/go/querystack – my page about the SQL Server export with more info about how I produce the database.
  • Data.StackExchange.com – a web-based SSMS where you can run your own queries against a recently restored copy of the Stack databases, or run other folks’ queries.
  • Watch Brent Tune Queries – free sessions where I take different queries from Data.StackExchange.com and tune it live.
  • How to Think Like the Engine – free videos where I show the Users table to explain clustered indexes, nonclustered indexes, statistics, sargability, and more.
Previous Post
Indexing Strategies For Cross Apply
Next Post
Nodes Offline: Why Your SQL Server VM Can’t Use All Its Virtual Hardware

5 Comments. Leave new

  • Jakob Bindslet
    June 12, 2018 3:40 pm

    Great stuff.

    I will download and seed.

    Reply
  • Wow, great blog post! Time to go looking!

    On a related subject, I have read/heard that a VMWare host with a mix of large and small core count guests, for example 8+ core SQL Servers and 2 core application servers, can cause CPUI starvation for the larger servers. Have you heard of that phenomenon?

    Reply
  • Hello Brent, So I downloaded the torrent successfully and had several files in it with this naming convention
    StackOverflow_1
    StackOverflow_2
    StackOverflow_3
    StackOverflow_4
    StackOverflow_log
    But when I did try to extract the files I had an error of corrupted file after which just 1,3nad 4 with the log extracted correctly but StackOverflow_2 seems to be corrupted is there anyway way you may know what the problem could be . Thank you may Love, Peace and Positive Reflection be within you.

    Reply
    • Your Internet connection probably had a glitch while downloading it. You could try again, or try a smaller one (like the 10GB 2010 one) if your Internet connection isn’t that fast or reliable. Hope that helps!

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.