How to Import the StackOverflow XML into SQL Server

Last Updated 11 years ago

UPDATE 2013: This code is no longer available. The size of the StackOverflow export has grown beyond what you can import with this method.

Want to play around with the StackOverflow database export? Here’s how to import the XML files into SQL Server, and some notes about the tables and data schema.

Script to Import StackOverflow XML to SQL Server

This T-SQL script will create six stored procedures:

usp_ETL_Load_Badges
usp_ETL_Load_Comments
usp_ETL_Load_Posts
usp_ETL_Load_Users
usp_ETL_Load_Votes
usp_ETL_Load_PostsTags (which isn’t one of the StackOverflow tables – more on that in a minute)

The XML import code is from an excellent XML tutorial by Denny Cherry. The scripts create a table (named Badges, Comments, Posts, Users, Votes) for each XML file. The schema matches the XML file with one exception – I added an identity field to the Badges table. The rest already had Id fields. The tables don’t have any indexes to speed querying. I would highly recommend that you not change the schema of any of these tables, because I’ll be giving out more scripts over the coming days and weeks that rely on the base tables. If you want to add more data, add additional tables. Plus this will keep your importing clean anyway – you can dump and reload the StackOverflow data repeatedly as long as you keep that data separate.

After importing, the database is about 2gb of data. Be aware that depending on your database’s recovery model and how you run these stored procs, your log file may be 2gb as well. None of the sentences in this paragraph blend together well, which bothers me but not quite enough to stop publishing the blog entry. Anyway, on we go.

If the table already exists when the stored proc runs, the table contents are deleted using the TRUNCATE TABLE command, which requires hefty permissions. If you don’t have admin rights on the box, substitute DELETE for the five TRUNCATE TABLE commands. Using DELETE will take significantly longer to run. For reference, with TRUNCATE TABLE, the stored procs take around 10 minutes on my faster machines, and around half an hour on my slower virtual machines.

These stored procs only work for the new database dump released on Monday morning, not the one released last week. If you get invalid XML errors while importing, you’ve got the older database dump. Go get the fresh hotness.

Now for some schema notes, and I’m going to go out of alphabetical order because everything links back to the Users table. I’m only going to cover the fields that aren’t immediately obvious:

Users Table

Id – primary key, identity field from the original StackOverflow database. Id 1 is “Community”, which is a special user that denotes community ownership, like wiki questions and answers.
LastAccessDate – this is useful because it tells you when the data export was last updated. If you’re doing queries for things like the last 30 days, check the most recent date here.
Age – the user enters this manually, so it’s not terribly reliable as I discovered earlier.
AboutMe – I’m using an nvarchar(max) field here, but you can go with a shorter field like nvarchar(2000).
UpVotes and DownVotes – the number of votes this user has cast.

Posts Table

In StackOverflow, questions and answers are both considered posts. If a record has a null ParentId field, then it’s a question. Otherwise, it’s an answer, and to find the matching question, join the ParentId field up to Posts.Id.

Id – primary key, identity field from the original StackOverflow database.
Title – the title of the question. Answer titles will be null.
OwnerUserId – joins back to Users.Id. If OwnerUserId = 1, that’s the community user, meaning it’s a wiki question or answer.
AcceptedAnswerId – for questions, this points to the Post.Id of the officially accepted answer. This isn’t necessarily the highest-voted answer, but the one the questioner accepted.
Tags – okay, time to blow out of the bullet points for a second.

StackOverflow limits you to five tags per question (answers aren’t tagged), and all five are stored in this field. For example, for question 305223, the Tags field is “<offtopic><fun><not-programming-related><jon-skeet>”. It’s up to you to normalize these. If you’d like to normalize them out into a child table, check out the usp_ETL_Load_PostsTags stored proc, which creates a PostsTags table with PostId and Tag fields. Each Posts record (questions only) will then have several child records in PostsTags.

Next, check the contents of the Tag field carefully. StackOverflow allows periods in the tag, like the .NET tag and ASP.NET tag. However, in the database, these are stored as “aspûnet”. Just something to be aware of.

Comments Table

Id – primary key, identity field from the original StackOverflow database.
PostId – the post parent for this comment. Joins to the Post.Id field.
UserId – who left the comment. Joins to the User.Id field.

Badges Table

Id – an identity field for a primary key. This number is meaningless – I just added it for some referential integrity.
UserId – joins back to Users.Id to show whose badge it is.
Name – the name of the Badge, like Teacher or Nice Answer.
CreationDate – when the user achieved the badge.

Votes Table

This stores the votes cast on posts, but the key field is VoteTypeId. The VoteType table wasn’t included in the export, so this table isn’t too useful yet, but if the guys give me the OK I’ll post the contents of that table here. The Votes table doesn’t include *who* cast the votes, and I’ve got my hands full analyzing the other tables anyway, so I haven’t been interested in the VoteTypes yet.

All of the Id fields except for Badges.Id are from StackOverflow’s original database. In theory, these numbers will not change, which means if you build your own child table structures like UserBaconPreferences, and you join via User.Id, you should be able to blow away and reload the Users table with every new StackOverflow database dump. That’s the theory, but in reality, you shouldn’t rely on anybody else’s ID fields, because there’s no reason to believe these won’t completely change down the road. Who knows – Jeff might switch over to GUIDs as primary keys.

Sample Questions Query

Once you’ve got it all together, you can do some fun stuff. Let’s look at some overall statistics about questions (not answers):

SELECT COALESCE(COUNT(DISTINCT p.ID),0)           AS Questions
       ,COALESCE(AVG(p.Score * 1.00),0)           AS AvgScore
       ,COALESCE(AVG(p.ViewCount * 1.00),0)       AS AvgViewCount
       ,COALESCE(COUNT(DISTINCT p.OwnerUserId),0) AS DistinctQuestioners
       ,COALESCE(AVG(p.AnswerCount * 1.00),0)     AS AvgAnswerCount
       ,COALESCE(AVG(p.CommentCount * 1.00),0)    AS AvgCommentCount
       ,COALESCE(AVG(p.FavoriteCount * 1.00),0)   AS AvgFavoriteCount
       ,COALESCE(COUNT(ClosedDate),0)             AS ClosedQuestions
       ,COALESCE(AVG(u.Reputation * 1.00),0)      AS AvgQuestionerReputation
       ,COALESCE(AVG(u.Age * 1.00),0)             AS AvgQuestionerAge
       ,COALESCE(AVG(u.UpVotes * 1.00),0)         AS AvgQuestionerUpVotes
       ,COALESCE(AVG(u.DownVotes * 1.00),0)       AS AvgQuestionerDownVotes
FROM   dbo.Posts p
       INNER JOIN dbo.Users u
         ON p.OwnerUserId = u.Id
WHERE p.Tags IS NOT NULL

SELECT COALESCE(COUNT(DISTINCT p.ID),0) AS Questions

,COALESCE(AVG(p.Score * 1.00),0) AS AvgScore

,COALESCE(AVG(p.ViewCount * 1.00),0) AS AvgViewCount

,COALESCE(COUNT(DISTINCT p.OwnerUserId),0) AS DistinctQuestioners

,COALESCE(AVG(p.AnswerCount * 1.00),0) AS AvgAnswerCount

,COALESCE(AVG(p.CommentCount * 1.00),0) AS AvgCommentCount

,COALESCE(AVG(p.FavoriteCount * 1.00),0) AS AvgFavoriteCount

,COALESCE(COUNT(ClosedDate),0) AS ClosedQuestions

,COALESCE(AVG(u.Reputation * 1.00),0) AS AvgQuestionerReputation

,COALESCE(AVG(u.Age * 1.00),0) AS AvgQuestionerAge

,COALESCE(AVG(u.UpVotes * 1.00),0) AS AvgQuestionerUpVotes

,COALESCE(AVG(u.DownVotes * 1.00),0) AS AvgQuestionerDownVotes

FROM dbo.Posts p

INNER JOIN dbo.Users u

ON p.OwnerUserId = u.Id

WHERE p.Tags IS NOT NULL

And some of the results are:

Questions – 176,137
Average Score – 1.89
Average View Count – 311
Distinct Questioners – 39,795 (meaning anyone who has asked a single question has asked an average of 4.4 questions – there may be some odd stuff in here around anonymous questions though, haven’t looked at that yet)
Average Answer Count – 4
Average Comment Count – 2.3
Closed Questions – 3,656 (or 2% of all questions)
Average Questioner Reputation – 1,506
Average Questioner Age – 30 (but remember, that’s unreliable)

I’m just getting started playing with it, and I’ll have a fun new StackOverflow statistics toy available for everybody to play with in a couple of days. In the meantime, you can download the StackOverflow database dump via BitTorrent and download my ETL stored procs.

Update: Sample StackOverflow Queries in the SQLServerPedia Wiki

Jon Skeet had an excellent idea: we need a wiki to store interesting queries. Wouldn’t you know, I happen to run one! I added a section in SQLServerPedia for sample StackOverflow database queries.

StackOverflow Data Mining: Cleansing the Data

The Best Thing I Learned at #SQLPASS

16 Comments. Leave new

Wes Brown
June 8, 2009 11:02 am

Thanks! you just saved me a couple of hours of work 🙂
I was thinking of using this as one of my test databases since it is so hard to get real world data in any form to play with.

Wes

Reply
Brent Ozar
June 8, 2009 11:05 am

Yeah, same here. This is going to make data mining demos a lot more fun.

Reply
Wes Brown
June 8, 2009 11:07 am

Do you know any more web sites that do this kind of dump?
I think I’ll scour around and see about keeping a list up somewhere.

Reply
Brent Ozar
June 8, 2009 11:21 am

Yeah, search the web for public data sets and you’ll find a bunch. Amazon’s look rather interesting:

http://aws.amazon.com/publicdatasets/

Reply
Wes Brown
June 8, 2009 11:23 am

Yeah, Wikipedia has a large dump available too.
I should have done this a long time ago!

Reply
Jason Alexander
June 8, 2009 1:42 pm

Brent, great stuff, thanks for publishing this –

Unfortunately, when I run usp_ETL_Load_PostTags I get an error:

Msg 208, Level 16, State 1, Procedure usp_ETL_Load_PostsTags, Line 34
Invalid object name ‘dbo.fn_NormalizeTags’.

Did you perhaps forget to include the NormalizeTags function?

Thanks for sharing this – this beats my approach of loading the data. 🙂
-Jason

Reply
- Brent Ozar
  June 8, 2009 1:56 pm
  
  Jason – DOH! Thanks for reporting that. I went ahead and added that to the zip file, and now I’m updating the scripts over at SQLServerPedia too. Sorry about that!
  
  Reply
  - Jason Alexander
    June 8, 2009 3:54 pm
    
    Hey, glad to help – thanks for updating, grabbed the download and reimporting now. 🙂 Thanks again, Brent!
    
    Reply
Marc Scheuner
February 14, 2010 4:13 am

Brent,

I’ve been using (and enjoying!) your ETL scripts to load the SO dump into my SQL Server 2008, but
for the past two SO dumps (Jan and Feb), I have been getting an error on the “Posts” table that an XML operation resulted in a XML larger than 2 GB and thus the operation has been aborted.

Seems the posts.xml has grown to over 2 GB and the OPENROWSET to load it doesn’t like that.

Any thoughts?

Marc

Reply
- Brent Ozar
  February 14, 2010 6:10 am
  
  You’ll want to switch to Sam Saffron’s excellent SoSlow utility. It’s much faster, too:
  
  http://stackoverflow.com/questions/999185/is-there-a-script-or-program-that-will-efficiently-and-quickly-load-up-the-so-dum
  
  Reply
Raza Jaffrey
July 3, 2012 5:39 am

Hi Brent,

thanks for the this, but i can’t seem to download the xmlzip file – as it appears to not be available on your site any longer?

Many thanks
Raza

Reply
- Brent Ozar
  July 3, 2012 8:47 am
  
  Raza – yep, I no longer host this.
  
  Reply
Pawan Kumar Khowal
July 10, 2015 10:06 pm

Hi Brent,

Can you help with some other alternative like bak file or something(old one will do), Actually my Comments.XML file itself is of 11GB 🙂 . I tried almost all the methods(SQL,SSIS,.NET,…) but no luck !.

-Pawan Kumar Khowal

Reply
- Brent Ozar
  July 11, 2015 5:28 am
  
  Sure, download one of the smaller sites instead. These same techniques work with all of the sites in the StackExchange network, and all their downloads are listed on that same page.
  
  Reply
Tony
January 21, 2019 11:29 pm

You took away the code because the data is too big!? Why didn’t you just leave the code and put a warning beside it that says the SPs don’t work properly anymore and then leave it to us on how we can use it. The code could have been used for other Stackexchange sites and their data dumps are a lot smaller than SO.

Reply
- Brent Ozar
  January 22, 2019 3:03 am
  
  Tony – good news! We took away the code but we replaced it with something much better – an app: https://github.com/BrentOzarULTD/soddi
  
  Reply