Blog

During next week’s Watch Brent Tune Queries webcast, I’m using my favorite demo database: Stack Overflow. The Stack Exchange folks are kind enough to make all of their data available via BitTorrent for Creative Commons usage as long as you properly attribute the source.

There’s two ways you can get started writing queries against Stack’s databases – the easy way and the hard way.

The Easy Way to Query StackOverflow.com

Point your browser over to Data.StackExchange.com and the available database list shows the number of questions and answers, plus the date of the database you’ll be querying:

StackExchange Database List

StackExchange Database List

At the time of this writing, the databases are updated every Monday. If you want even more recent data, you can use the Stack Exchange API, but that’s a story for another day.

Click on the site you’d like to query, and you’ll get a list of queries you can start with, or click Compose Query at the top right. As an example, let’s look at a query that compares the popularity of tags:

Querying StackOverflow

Querying StackOverflow

Yes, this is a lot like SQL Server Management Studio in the browser. At the top, we’ve got our query, plus space for a couple of parameters. One of the fun parts about Data Explorer is that you can design queries to take parameters to show information for different users, date ranges, tags, etc.

At the bottom, notice the tabs for Results, Messages, and Graph. If your results look graph-friendly, Data Explorer is smart enough to figure that out:

Query Results in Graph Format

Query Results in Graph Format

And yes, Data Professional, that last tab does indeed say Execution Plan, and it renders in your browser right down to the ability to hover your mouse over parts of the plan and see more details:

Execution Plan

Execution Plan

Data Explorer is really easy to use – check out the one-page documentation. It’s even open source, so to learn more about how it works, visit StackExchange.DataExplorer on Github.

Some system commands (like SET STATISTICS IO ON) are allowed, but you can’t create indexes, and there aren’t many indexes to begin with. You can also shoot yourself in the foot by writing an extraordinarily ugly query, and the system won’t stop you – for example, SELECT * FROM Posts will start running, but then may crash your browser as they start returning data. Jeremiah and I managed to repeatedly kill our Chrome browsers while tuning queries for fun.

I like using this to go poking around for unusual questions or answers. For example, I like to find questions that are viewed a lot, but don’t have any upvoted answers yet. (That’s prime territory for a SQL Server geek like me that wants to find tough questions to solve.)

The Hard Way to Query StackOverflow.COM

First, you’ll need to download a copy of the most recent XML data dump. These files are pretty big – around 15GB total – so there’s no direct download for the entire repository. There’s two ways you can get the September 2013 export:

I strongly recommend working with a smaller site’s data first like DBA.StackExchange. If you decide to work with the monster StackOverflow.com’s data, you’re going to temporarily need:

  • ~15GB of space for the download
  • ~60GB after the StackOverflow.com exports are expanded with 7zip. They’re XML, so they compress extremely well for download, but holy cow, XML is wordy.
  • ~50GB for the SQL Server database (and this will stick around)

Next, you need a tool to load that XML into the database platform of your choosing. For Microsoft SQL Server, I use Jeremiah’s improved version of the old Sky Sanders’ SODDI. Sky stopped updating his version a few years ago, and it’s no longer compatible with the current Stack dumps. Jeremiah’s current download is here, and it works with the January 2014 data dump. (The previous version works on the September 2013 data dump.)

The SODDI user interface expects the XML files to be stored in a very specific folder name: MMYYYY SITEINITIALS, like 092013 SO. SODDI will import multiple sites, and it creates a different schema for each site. I just want to import Stack Overflow by itself, all in its own database, so I like naming my folder “092013 dbo”. That way, it creates the tables in the dbo schema, which is just a little friendlier for demos.

When you run SODDI.exe without parameters, this GUI pops up (assuming that you named your Stack Overflow demo folder 092013 dbo):

SODDI with my settings

SODDI with my settings

Source is the folder where you saved the data dumps. It expects to see subfolders in there for 092013 dbo.

Target is the connection string for your database server. I’m using a local SQL Server (note that I picked SqlClient in the Provider dropdown) with a database named StackOverflow, so my connection string is:

Data Source=(local); Initial Catalog=StackOverflow; Integrated Security=True

If you want to use a remote SQL Server, you’d put its name in there instead of (local). You’ll also need to pre-create the database you want to use.

Ah, the StackOverflow Database

Ah, the StackOverflow Database

Click Import, and after a lot of disk churn, you’re rewarded with a StackOverflow database with tables for Badges, Comments, Posts, PostTypes, Users, Votes, and VoteTypes.

The resulting database is about 50GB. SQL Server’s data compression doesn’t work too well here because most of the data is off-row LOBs. Backup compression works well, though, with the resulting backup coming in at around 13GB.

Why Go to All This Work?

When I’m teaching performance tuning of queries and indexes, there’s no substitute for a local copy of the database. I want to show the impact of new indexes, analyze execution plans with SQL Sentry Plan Explorer, and run load tests with HammerDB.

That’s what we do in our SQL Server Performance Troubleshooting class – specifically, in my modules on How to Think Like the Engine, What Queries are Killing My Server, T-SQL Anti-patterns, and My T-SQL Tuning Process. Forget AdventureWorks – it’s so much more fun to use real StackOverflow.com data to discover tag patterns, interesting questions, and helpful users.

↑ Back to top
  1. Pingback: My links of the week – January 19, 2014 | R4

  2. Thank you for the good explanation of the “hard way” to download and use the StackOverflow database!

    StackStash is an iPhone app that uses compression, and brings the database size down to only 3.6 GB.
    http://stackstash.com

    How do they do it? I have an iPhone, but I also want to be able to access the StackOverflow database while offline on my MacBook Pro. I’m already doing this for Wikipedia using Wiki2Touch, and Google Maps using MOBAC+phpMBtilesLite.

    StackDump uses a local web server, which would be suitable for cross-platform sharing.
    http://stackapps.com/questions/3610/stackdump-an-offline-browser-for-stackexchange-sites

    However, StackDump also takes up 30GB of space. I know that StackOverflow is large, but the English Wikipedia is only 7.5 GB when used with Wiki2Touch! Are you aware of a StackOverflow query app that allows for fast searching within a limited index, and brings the database size below 5 GB?

    My 64GB iPhone is running out of space, with music (no more than 128 kbps MP3), photos (scaled down to iPhone screen resolution, I have all my 20,000 photos in 3.5 GB), Galileo Offline Maps (about 2GB per country), and Wiki2Touch. Please let me know if there’s a suitable web app that can access a compressed database! If there isn’t, would you like me to make one?

  3. Hi Brent,

    I tried to do import for Photo stack-exchange on 2011 data dump. Tables for Users, Posts, Badges, Comments and Votes were created successfully. However, as there is no xml file for PostLink in the dump (instead there is PostHistory.xml), so for that table I received an error. How could I modify SODDI to run successfully on other stack exchange sites besides Stack overflow for my 2011 data dump. I am using SODDI version 1.0.

    PS: you could delete my previous irrelevant comments to your post!

    Thanks

  4. Pingback: The One Pound SQL Server: Dell Venue Pro 8 - Brent Ozar | Brent Ozar

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

css.php