UPDATE 2013: This code is no longer available. The size of the StackOverflow export has grown beyond what you can import with this method.
Want to play around with the StackOverflow database export? Here’s how to import the XML files into SQL Server, and some notes about the tables and data schema.
Script to Import StackOverflow XML to SQL Server
This T-SQL script will create six stored procedures:
- usp_ETL_Load_PostsTags (which isn’t one of the StackOverflow tables – more on that in a minute)
The XML import code is from an excellent XML tutorial by Denny Cherry. The scripts create a table (named Badges, Comments, Posts, Users, Votes) for each XML file. The schema matches the XML file with one exception – I added an identity field to the Badges table. The rest already had Id fields. The tables don’t have any indexes to speed querying. I would highly recommend that you not change the schema of any of these tables, because I’ll be giving out more scripts over the coming days and weeks that rely on the base tables. If you want to add more data, add additional tables. Plus this will keep your importing clean anyway – you can dump and reload the StackOverflow data repeatedly as long as you keep that data separate.
After importing, the database is about 2gb of data. Be aware that depending on your database’s recovery model and how you run these stored procs, your log file may be 2gb as well. None of the sentences in this paragraph blend together well, which bothers me but not quite enough to stop publishing the blog entry. Anyway, on we go.
If the table already exists when the stored proc runs, the table contents are deleted using the TRUNCATE TABLE command, which requires hefty permissions. If you don’t have admin rights on the box, substitute DELETE for the five TRUNCATE TABLE commands. Using DELETE will take significantly longer to run. For reference, with TRUNCATE TABLE, the stored procs take around 10 minutes on my faster machines, and around half an hour on my slower virtual machines.
These stored procs only work for the new database dump released on Monday morning, not the one released last week. If you get invalid XML errors while importing, you’ve got the older database dump. Go get the fresh hotness.
Now for some schema notes, and I’m going to go out of alphabetical order because everything links back to the Users table. I’m only going to cover the fields that aren’t immediately obvious:
- Id – primary key, identity field from the original StackOverflow database. Id 1 is “Community”, which is a special user that denotes community ownership, like wiki questions and answers.
- LastAccessDate – this is useful because it tells you when the data export was last updated. If you’re doing queries for things like the last 30 days, check the most recent date here.
- Age – the user enters this manually, so it’s not terribly reliable as I discovered earlier.
- AboutMe – I’m using an nvarchar(max) field here, but you can go with a shorter field like nvarchar(2000).
- UpVotes and DownVotes – the number of votes this user has cast.
In StackOverflow, questions and answers are both considered posts. If a record has a null ParentId field, then it’s a question. Otherwise, it’s an answer, and to find the matching question, join the ParentId field up to Posts.Id.
- Id – primary key, identity field from the original StackOverflow database.
- Title – the title of the question. Answer titles will be null.
- OwnerUserId – joins back to Users.Id. If OwnerUserId = 1, that’s the community user, meaning it’s a wiki question or answer.
- AcceptedAnswerId – for questions, this points to the Post.Id of the officially accepted answer. This isn’t necessarily the highest-voted answer, but the one the questioner accepted.
- Tags – okay, time to blow out of the bullet points for a second.
StackOverflow limits you to five tags per question (answers aren’t tagged), and all five are stored in this field. For example, for question 305223, the Tags field is “<offtopic><fun><not-programming-related><jon-skeet>”. It’s up to you to normalize these. If you’d like to normalize them out into a child table, check out the usp_ETL_Load_PostsTags stored proc, which creates a PostsTags table with PostId and Tag fields. Each Posts record (questions only) will then have several child records in PostsTags.
Next, check the contents of the Tag field carefully. StackOverflow allows periods in the tag, like the .NET tag and ASP.NET tag. However, in the database, these are stored as “aspûnet”. Just something to be aware of.
- Id – primary key, identity field from the original StackOverflow database.
- PostId – the post parent for this comment. Joins to the Post.Id field.
- UserId – who left the comment. Joins to the User.Id field.
- Id – an identity field for a primary key. This number is meaningless – I just added it for some referential integrity.
- UserId – joins back to Users.Id to show whose badge it is.
- Name – the name of the Badge, like Teacher or Nice Answer.
- CreationDate – when the user achieved the badge.
This stores the votes cast on posts, but the key field is VoteTypeId. The VoteType table wasn’t included in the export, so this table isn’t too useful yet, but if the guys give me the OK I’ll post the contents of that table here. The Votes table doesn’t include *who* cast the votes, and I’ve got my hands full analyzing the other tables anyway, so I haven’t been interested in the VoteTypes yet.
All of the Id fields except for Badges.Id are from StackOverflow’s original database. In theory, these numbers will not change, which means if you build your own child table structures like UserBaconPreferences, and you join via User.Id, you should be able to blow away and reload the Users table with every new StackOverflow database dump. That’s the theory, but in reality, you shouldn’t rely on anybody else’s ID fields, because there’s no reason to believe these won’t completely change down the road. Who knows – Jeff might switch over to GUIDs as primary keys.
Sample Questions Query
Once you’ve got it all together, you can do some fun stuff. Let’s look at some overall statistics about questions (not answers):
SELECT COALESCE(COUNT(DISTINCT p.ID),0) AS Questions ,COALESCE(AVG(p.Score * 1.00),0) AS AvgScore ,COALESCE(AVG(p.ViewCount * 1.00),0) AS AvgViewCount ,COALESCE(COUNT(DISTINCT p.OwnerUserId),0) AS DistinctQuestioners ,COALESCE(AVG(p.AnswerCount * 1.00),0) AS AvgAnswerCount ,COALESCE(AVG(p.CommentCount * 1.00),0) AS AvgCommentCount ,COALESCE(AVG(p.FavoriteCount * 1.00),0) AS AvgFavoriteCount ,COALESCE(COUNT(ClosedDate),0) AS ClosedQuestions ,COALESCE(AVG(u.Reputation * 1.00),0) AS AvgQuestionerReputation ,COALESCE(AVG(u.Age * 1.00),0) AS AvgQuestionerAge ,COALESCE(AVG(u.UpVotes * 1.00),0) AS AvgQuestionerUpVotes ,COALESCE(AVG(u.DownVotes * 1.00),0) AS AvgQuestionerDownVotes FROM dbo.Posts p INNER JOIN dbo.Users u ON p.OwnerUserId = u.Id WHERE p.Tags IS NOT NULL
And some of the results are:
- Questions – 176,137
- Average Score – 1.89
- Average View Count – 311
- Distinct Questioners – 39,795 (meaning anyone who has asked a single question has asked an average of 4.4 questions – there may be some odd stuff in here around anonymous questions though, haven’t looked at that yet)
- Average Answer Count – 4
- Average Comment Count – 2.3
- Closed Questions – 3,656 (or 2% of all questions)
- Average Questioner Reputation – 1,506
- Average Questioner Age – 30 (but remember, that’s unreliable)
I’m just getting started playing with it, and I’ll have a fun new StackOverflow statistics toy available for everybody to play with in a couple of days. In the meantime, you can download the StackOverflow database dump via BitTorrent and download my ETL stored procs.
Jon Skeet had an excellent idea: we need a wiki to store interesting queries. Wouldn’t you know, I happen to run one! I added a section in SQLServerPedia for sample StackOverflow database queries.