How to Query the StackExchange Databases

Last Updated 8 years ago

Update – October 2015 – You can download a Torrent of a database (rather than a data dump) now.

Most of my demos involve my favorite demo database: Stack Overflow. The Stack Exchange folks are kind enough to make all of their data available via BitTorrent for Creative Commons usage as long as you properly attribute the source.

There’s two ways you can get started writing queries against Stack’s databases – the easy way and the hard way.

The Hard Way to Query StackOverflow

First, you’ll need to download a copy of the most recent XML data dump. These files are pretty big – around 15GB total – so there’s no direct download for the entire repository. There’s two ways you can get the September 2013 export:

Download them from Mega – which lets you pick the specific sites you’d like to download.
Download them all at once via BitTorrent with the StackExchange torrent link at Archive.org

I strongly recommend working with a smaller site’s data first like DBA.StackExchange. If you decide to work with the monster StackOverflow.com’s data, you’re going to temporarily need:

~15GB of space for the download
~60GB after the StackOverflow.com exports are expanded with 7zip. They’re XML, so they compress extremely well for download, but holy cow, XML is wordy.
~50GB for the SQL Server database (and this will stick around)

Next, you need a tool to load that XML into the database platform of your choosing. For Microsoft SQL Server, I use the Stack Overflow Data Dump Importer.

The SODDI user interface expects two things:

The XML files have to be stored in a very specific folder name: MMYYYY SITEINITIALS, like 092013 SO. SODDI will import multiple sites, and it creates a different schema for each site.
All of the XML files for the site have to be present in that folder: Badges, Comments, PostHistory, PostLinks, Posts, Tags, Users, and Votes.

When you run SODDI.exe without parameters, this GUI pops up (assuming that you named your Stack Overflow demo folder 092013 SO):

Source is the folder where you saved the data dumps. It expects to see subfolders in there for 092013 SO.

Target is the connection string for your database server. I’m using a local SQL Server (note that I picked SqlClient in the Provider dropdown) with a database named StackOverflow, so my connection string is:

Data Source=(local); Initial Catalog=StackOverflow; Integrated Security=True

If you want to use a remote SQL Server, you’d put its name in there instead of (local). You’ll also need to pre-create the database you want to use.

Click Import, and after a lot of disk churn, you’re rewarded with a StackOverflow database with tables for Badges, Comments, Posts, PostTypes, Users, Votes, and VoteTypes.

The resulting database is about 50GB. SQL Server’s data compression doesn’t work too well here because most of the data is off-row LOBs. Backup compression works well, though, with the resulting backup coming in at around 13GB.

Why Go to All This Work?

When I’m teaching performance tuning of queries and indexes, there’s no substitute for a local copy of the database. I want to show the impact of new indexes, analyze execution plans with SQL Sentry Plan Explorer, and run load tests with HammerDB.

That’s what we do in our SQL Server Performance Troubleshooting class – specifically, in my modules on How to Think Like the Engine, What Queries are Killing My Server, T-SQL Anti-patterns, and My T-SQL Tuning Process. Forget AdventureWorks – it’s so much more fun to use real StackOverflow.com data to discover tag patterns, interesting questions, and helpful users.

The Easy Way to Query StackOverflow.com

Point your browser over to Data.StackExchange.com and the available database list shows the number of questions and answers, plus the date of the database you’ll be querying:

At the time of this writing, the databases are updated every Monday. If you want even more recent data, you can use the Stack Exchange API, but that’s a story for another day.

Click on the site you’d like to query, and you’ll get a list of queries you can start with, or click Compose Query at the top right. As an example, let’s look at a query that compares the popularity of tags:

Yes, this is a lot like SQL Server Management Studio in the browser. At the top, we’ve got our query, plus space for a couple of parameters. One of the fun parts about Data Explorer is that you can design queries to take parameters to show information for different users, date ranges, tags, etc.

At the bottom, notice the tabs for Results, Messages, and Graph. If your results look graph-friendly, Data Explorer is smart enough to figure that out:

And yes, Data Professional, that last tab does indeed say Execution Plan, and it renders in your browser right down to the ability to hover your mouse over parts of the plan and see more details:

Data Explorer is really easy to use – check out the one-page documentation. It’s even open source, so to learn more about how it works, visit StackExchange.DataExplorer on Github.

Some system commands (like SET STATISTICS IO ON) are allowed, but you can’t create indexes, and there aren’t many indexes to begin with. You can also shoot yourself in the foot by writing an extraordinarily ugly query, and the system won’t stop you – for example, SELECT * FROM Posts will start running, but then may crash your browser as they start returning data. Jeremiah and I managed to repeatedly kill our Chrome browsers while tuning queries for fun.

I like using this to go poking around for unusual questions or answers. For example, I like to find questions that are viewed a lot, but don’t have any upvoted answers yet. (That’s prime territory for a SQL Server geek like me that wants to find tough questions to solve.)

Careful Adding Indexes with Always On Availability Groups

sp_Blitz® v32: Emailing, Serializable Locking, Simultaneous Agent Jobs, and More

27 Comments. Leave new

Peter Burkimsher
January 23, 2014 11:16 pm

Thank you for the good explanation of the “hard way” to download and use the StackOverflow database!

StackStash is an iPhone app that uses compression, and brings the database size down to only 3.6 GB.
http://stackstash.com

How do they do it? I have an iPhone, but I also want to be able to access the StackOverflow database while offline on my MacBook Pro. I’m already doing this for Wikipedia using Wiki2Touch, and Google Maps using MOBAC+phpMBtilesLite.

StackDump uses a local web server, which would be suitable for cross-platform sharing.
http://stackapps.com/questions/3610/stackdump-an-offline-browser-for-stackexchange-sites

However, StackDump also takes up 30GB of space. I know that StackOverflow is large, but the English Wikipedia is only 7.5 GB when used with Wiki2Touch! Are you aware of a StackOverflow query app that allows for fast searching within a limited index, and brings the database size below 5 GB?

My 64GB iPhone is running out of space, with music (no more than 128 kbps MP3), photos (scaled down to iPhone screen resolution, I have all my 20,000 photos in 3.5 GB), Galileo Offline Maps (about 2GB per country), and Wiki2Touch. Please let me know if there’s a suitable web app that can access a compressed database! If there isn’t, would you like me to make one?

Reply
- Brent Ozar
  January 24, 2014 5:24 am
  
  Peter – wow, you’re really enthusiastic! Here’s more questions in here than I can answer, but the short story is no, I don’t need an app. My purpose for using the Stack databases is to teach people a database product (Microsoft SQL Server, the same one we use at Stack).
  
  Thanks though, and enjoy the journey!
  
  Reply
  - Peter Burkimsher
    March 13, 2014 2:34 pm
    
    By the way, I finally did write an offline compressed StackOverflow database app!
    
    https://github.com/peterburk/stackoverflowlocal
    
    The database can be further compressed to only 2.3GB if you run some other scripts to remove unnecessary XML, and unanswered questions.
    
    I’ve used it a lot on my Mac for the last few weeks.
    
    Reply
    - Brent Ozar
      March 14, 2014 6:11 am
      
      Ooo, thanks for the heads up! I’ll have to play around with that.
      
      Reply
Smrite
April 7, 2014 3:52 pm

Hi Brent,

I tried to do import for Photo stack-exchange on 2011 data dump. Tables for Users, Posts, Badges, Comments and Votes were created successfully. However, as there is no xml file for PostLink in the dump (instead there is PostHistory.xml), so for that table I received an error. How could I modify SODDI to run successfully on other stack exchange sites besides Stack overflow for my 2011 data dump. I am using SODDI version 1.0.

PS: you could delete my previous irrelevant comments to your post!

Thanks

Reply
- Brent Ozar
  April 7, 2014 6:50 pm
  
  Smrite – unfortunately this is beyond the scope of something I can help you with in blog comments. Try posting this on http://meta.stackoverflow.com.
  
  Reply
Jonathan
April 18, 2015 6:41 pm

Hello, i’m trying to do this, but SODDI tells me this: “Cannot open database xxxx requested by the login. The login failed. Login failed for user HOST\user”.

I’m using SQL Server 2014, and VS 2010, what happens?

I just want to see a community shema, not the data, could you please send me a script of the model schema? thanks:D!

Reply
- Brent Ozar
  April 19, 2015 8:08 am
  
  Hi Jonathan. To see the basic schema, head over to http://data.stackexchange.com. You can query it online. Enjoy!
  
  Reply
  - Jonathan
    April 19, 2015 11:08 pm
    
    Ok, i’ve got a question, you work at SE right?, is there any super db, i mean, for each community there’s a db, but “Users” table has a fk “AccountId” is that related to any mother db? does SE has a db to manage all others?, where’s the account of SE stored?, do we have access to that db?
    
    Reply
    - Brent Ozar
      April 20, 2015 6:45 am
      
      Jonathan – I’m a consultant for SE, and yeah, there are other databases that aren’t included in the exports.
      
      Reply
Jeff Taylor
July 17, 2015 10:52 am

I ran into an issue while attempting to extract the large files using the 7Zip UI. It ate up hard drive space in my user temp folder filling up the hard drive. Changing the temp folder in the UI didn’t work. My workaround was just using the shell extension to extract directly to my destination.

Reply
Eyal Golombek
August 2, 2015 5:32 am

Hello,
When I press the import button in the SODDI GUI, I get an error that the source path is invalid

Reply
Denis Reznik
September 2, 2015 4:52 pm

Wow! Brent, Thank you! I’ve followed a “hard way” this year 🙂 (but anyway I needed that :)), but querying in browser with all that great tools and information, it is awesome!

Reply
Mark Hions
September 11, 2015 3:13 pm

Just wanted to say that you can also start the torrent download from Archive.Org, then tell your bittorrent client to ignore most files and just get the select few you’re interested in.

Reply
Facebook Cloaker
October 9, 2015 3:40 am

Hello,
When I press the import button in the SODDI GUI, I get an error that the source path is invalid

Reply
- Brent Ozar
  October 9, 2015 7:05 am
  
  That means that your source path is invalid. It doesn’t point to where the Stack database exports live.
  
  Reply
Igor Alves
March 23, 2016 2:42 pm

Hi, nice explanation.
I do not understand the difference between #1 – archive.org/details/stackexchange and #2 – https://data.stackexchange.com/stackoverflow/query/new .

For each database in #1 we only have 8 tables, while in #2, stackExchange folks, provide 27 tables.

Why do we have this difference?

Thanks

Reply
- Brent Ozar
  March 23, 2016 4:46 pm
  
  Igor – that’s a great question. Why don’t you ask it over at http://meta.stackoverflow.com/ where Stack questions are discussed?
  
  Reply
Gioi Hoang
January 17, 2017 2:08 am

Hello Brent Ozar,
When I press the import button in the SODDI GUI, I get an error with field “RevisionGUID” of PostHistory table.
Error Detail: The given value of type String from the data source cannot be converted to type uniqueidentier of the specified target column.
Thanks,

Reply
- Brent Ozar
  January 17, 2017 5:55 am
  
  Gioi – hi, and welcome to the site. For support on open source tools, you’ll want to go to the Github repo for that tool, and open an issue. Include as much detail as you can, like the exact build of SODDI that you’re using, and the data dump you’re importing.
  
  Reply
Paul Hewson
May 11, 2017 9:53 am

Hi All,

Thanks Brent for a very useful post! If you want an alternative way of importing the tables individually, (as mentioned above SODDI.exe requires that all the xml files are available) I have written a blog post on how to do it here, http://www.sqlserversnippets.com/2017/05/how-to-import-large-xml-file-into-sql_95.html. Hope it is of use.

Reply
Santa
October 14, 2017 2:03 pm

Am I just stupid or is there no soddi.exe on GitHub (some thing needs to be complied perhaps). All I wanted was a large database (I am a DBA, or at least I like to think I am).

Reply
- Brent Ozar
  October 14, 2017 2:19 pm
  
  Santa – yeah, if you’re not comfortable compiling it, you’re better off just downloading the database:
  
  https://www.brentozar.com/archive/2015/10/how-to-download-the-stack-overflow-database-via-bittorrent/
  
  Reply
  - Santa
    October 14, 2017 2:27 pm
    
    Wow, I can believe you replied so fast, Brent. Thanks. Also, I just installed VisualStudio and got the soddi.exe with a bit of fooling around. Thanks again 🙂
    
    Reply
pratim
November 28, 2017 12:00 pm

Hello Everyone,

I wanted to know if you could connect StackOverflow directly from SSMS? I think that would be helpful and easier to browse through the data.

Reply
- Brent Ozar
  November 28, 2017 12:08 pm
  
  It would be easier for you to connect to someone else’s database server? Yes, yes, I’m sure that it would. Who would you like to pay the licensing costs for that?
  
  Reply
  - pratim
    November 28, 2017 12:22 pm
    
    Thank you, I like the “Easy way” you suggested 🙂 I will also plan for the hard way.
    
    I also saw the SPID in different tabs came same (61) and the credential was also not my ID but a common one, STACKEXCHANGE\svc_sede
    
    Reply