@Taryn and the kind folks at Stack Overflow publish their data export periodically with your questions, answers, comments, user info, and more. It’s available as an XML data dump, which I then take and import into SQL Server for teaching performance tuning.
You can download the 38GB torrent (magnet), which gives you a series of 7Zip files that you can extract to produce a ~304GB SQL Server 2008 database. You can then attach it to any 2008-2017 SQL Server. (If you want a smaller 10GB version, check out the mini 10GB version circa 2010.
The data goes up to 2018-06-03 and includes:
- Badges – 27M rows, 1GB data.
- Comments – 66M rows, 22GB
- PostHistory -106M rows, 174GB (new)
- Posts – 41M rows, 106GB
- Users – 9M rows, 1GB.
- Votes – 151M rows, 5GB.
- And a few smaller supporting tables: PostHistoryTypes, PostLinks, PostTypes, and Tags.
The new PostHistory table makes this one way bigger.
The Posts table holds questions and answers. The PostHistory table tracks events that have happened to a post over time, like edits. One of my favorite features of the Stack sites is the ability for people to edit each others’ questions and answers. You can jump right in and improve someone’s question by adding more details, or edit their answers, too.
To join them together:
SELECT pht.Type, ph.*
FROM dbo.PostHistory ph
INNER JOIN dbo.PostHistoryTypes pht ON ph.PostHistoryTypeId = pht.Id
WHERE ph.PostId = 11227809
ORDER BY ph.CreationDate;
Results for a very active question:
I haven’t included this table in the past because it’s so large, and I wanted people to be able to do Stack query demos on smaller machines. However, we’ve solved that problem by distributing a separate 10GB StackOverflow2010 database (1GB 7zip) showing data from 2008-2010.
To learn more about using the database:
- BrentOzar.com/go/querystack – my page about the SQL Server export with more info about how I produce the database.
- Data.StackExchange.com – a web-based SSMS where you can run your own queries against a recently restored copy of the Stack databases, or run other folks’ queries.
- Watch Brent Tune Queries – free sessions where I take different queries from Data.StackExchange.com and tune it live.
- How to Think Like the Engine – free videos where I show the Users table to explain clustered indexes, nonclustered indexes, statistics, sargability, and more.