StackOverflow released a public dump of their database this morning. Jeff Atwood and the guys believe that if you, the community, are putting the work into this huge body of knowledge, then you should be able to have rights to use it.
This is a great dataset to show off one of my favorite toys from the Microsoft SQL Server Data Mining team. In this half-hour video, Tom LaRock and I will walk you through data mining the StackOverflow user list to find out more about the users and see what makes the rockstar high-reputation users different from the worker bees like me.
If this looks interesting to you, here’s what else I’ve been doing with the StackOverflow data:
- StackOverflow Database Schema article at SQLServerPedia, the wiki I manage
- How to import the StackOverflow XML files into Microsoft SQL Server
Now, back to what I did in the video – let’s talk about the tools I used.
Microsoft’s Free Data Mining Tools
For today’s demo, I’m using SQL Server Analysis Services installed on my desktop. Relax – it’s really easy. Literally just install SQL Server 2005 or 2008 Developer Edition, check the box for Analysis Services, and use the defaults. You don’t have to know what you’re doing in order to get it up and running, and it just runs in the background as a service. After you’re done playing around, you can stop the service and set it to manual to prevent it from sapping your system resources. Go into Control Panel, Administrative Tools, double-click on the SQL Server Analysis Services service, and change the startup type to Manual.
Depending on your version of SQL Server and Excel, you’ll need one of these free plugins from Microsoft:
- Microsoft SQL Server 2005 Data Mining Add-Ins for Office 2007
- Microsoft SQL Server 2008 Data Mining Add-Ins for Office 2007
- And you can download the StackOverflow users spreadsheet shown in the video
If you want to avoid the whole SQL Server Analysis Services thing altogether, you can also use Microsoft’s free SQL Server Data Mining in the Cloud plugin. Be aware that it’s a technical preview, not a fully supported & released product. Their cloud servers can (and do) go down. Also know that your data is going into the cloud, which has its own ramifications as I’ve discussed in my previous cloud data mining tutorial.
What’s Coming Next: SQL Server 2008 R2 with BI in Excel
In the next version of SQL Server, Microsoft will deliver business intelligence to end users through Excel. At the Professional Association for SQL Server Summit last November, Donald Farmer demoed slicing and dicing of huge spreadsheets with real-time analytics that previously would have required some pretty hefty hardware.
Excel 2007 has a million-row limit, but the forthcoming version will not. Some of the StackOverflow export tables like Votes have more than a million rows, so we can’t yet data mine those using Excel as a front end, but we can play with the Users table today.
Subscribing or Downloading My Podcasts
If you have an MP3 player or a portable video player and you want to download my podcasts automatically, you can subscribe to the SQLServerPedia podcast feeds here:
- MP4 (Apple) Video Feed
- WMV (Microsoft) Video Feed
- MP3 Audio-Only Feed
- Zune One-Click Subscribe for Video
- Zune One-Click Subscribe for MP3
You can also download this video to watch it later: