Awards


Data Mining the StackOverflow Database

StackOverflow released a public dump of their database this morning. Jeff Atwood and the guys believe that if you, the community, are putting the work into this huge body of knowledge, then you should be able to have rights to use it.

This is a great dataset to show off one of my favorite toys from the Microsoft SQL Server Data Mining team. In this fifteen minute video, I’ll walk you through data mining the StackOverflow user list to find out more about the users and see what makes the rockstar high-reputation users different from the worker bees like me.

Get the Flash Player to see the wordTube Media Player.

If this looks interesting to you, here’s what else I’ve been doing with the StackOverflow data:

Now, back to what I did in the video – let’s talk about the tools I used.

Microsoft’s Free Data Mining Tools

For today’s demo, I’m using SQL Server Analysis Services installed on my desktop. Relax – it’s really easy. Literally just install SQL Server 2005 or 2008 Developer Edition, check the box for Analysis Services, and use the defaults. You don’t have to know what you’re doing in order to get it up and running, and it just runs in the background as a service. After you’re done playing around, you can stop the service and set it to manual to prevent it from sapping your system resources. Go into Control Panel, Administrative Tools, double-click on the SQL Server Analysis Services service, and change the startup type to Manual.

Depending on your version of SQL Server and Excel, you’ll need one of these free plugins from Microsoft:

If you want to avoid the whole SQL Server Analysis Services thing altogether, you can also use Microsoft’s free SQL Server Data Mining in the Cloud plugin. Be aware that it’s a technical preview, not a fully supported & released product. Their cloud servers can (and do) go down. Also know that your data is going into the cloud, which has its own ramifications as I’ve discussed in my previous cloud data mining tutorial.

What’s Coming Next: SQL Server 2008 R2 with BI in Excel

In the next version of SQL Server, Microsoft will deliver business intelligence to end users through Excel. At the Professional Association for SQL Server Summit last November, Donald Farmer demoed slicing and dicing of huge spreadsheets with real-time analytics that previously would have required some pretty hefty hardware.

Excel 2007 has a million-row limit, but the forthcoming version will not. Some of the StackOverflow export tables like Votes have more than a million rows, so we can’t yet data mine those using Excel as a front end, but we can play with the Users table today.

Subscribing or Downloading My Podcasts

If you have an MP3 player or a portable video player and you want to download my podcasts automatically, you can subscribe to the SQLServerPedia podcast feeds here:

You can also download this video to watch it later:

7 comments to Data Mining the StackOverflow Database

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">