Tag Archive: datamining

How to Import the StackOverflow XML into SQL Server

Want to play around with the StackOverflow database export?  Here’s how to import the XML files into SQL Server, and some notes about the tables and data schema.

Script to Import StackOverflow XML to SQL Server

This T-SQL script will create six stored procedures:

  • usp_ETL_Load_Badges
  • usp_ETL_Load_Comments
  • usp_ETL_Load_Posts
  • usp_ETL_Load_Users
  • usp_ETL_Load_Votes
  • usp_ETL_Load_PostsTags (which isn’t one of the StackOverflow tables – more on that in a minute)

The XML import code is from an excellent XML tutorial by Denny Cherry.  The scripts create a table (named Badges, Comments, Posts, Users, Votes) for each XML file.  The schema matches the XML file with one exception – I added an identity field to the Badges table.  The rest already had Id fields.  The tables don’t have any indexes to speed querying. I would highly recommend that you not change the schema of any of these tables, because I’ll be giving out more scripts over the coming days and weeks that rely on the base tables.  If you want to add more data, add additional tables.  Plus this will keep your importing clean anyway – you can dump and reload the StackOverflow data repeatedly as long as you keep that data separate.

After importing, the database is about 2gb of data.  Be aware that depending on your database’s recovery model and how you run these stored procs, your log file may be 2gb as well. None of the sentences in this paragraph blend together well, which bothers me but not quite enough to stop publishing the blog entry. Anyway, on we go.

If the table already exists when the stored proc runs, the table contents are deleted using the TRUNCATE TABLE command, which requires hefty permissions.  If you don’t have admin rights on the box, substitute DELETE for the five TRUNCATE TABLE commands.  Using DELETE will take significantly longer to run.  For reference, with TRUNCATE TABLE, the stored procs take around 10 minutes on my faster machines, and around half an hour on my slower virtual machines.

These stored procs only work for the new database dump released on Monday morning, not the one released last week.  If you get invalid XML errors while importing, you’ve got the older database dump.  Go get the fresh hotness.

Now for some schema notes, and I’m going to go out of alphabetical order because everything links back to the Users table.  I’m only going to cover the fields that aren’t immediately obvious:

Users Table

  • Id – primary key, identity field from the original StackOverflow database.  Id 1 is “Community”, which is a special user that denotes community ownership, like wiki questions and answers.
  • LastAccessDate – this is useful because it tells you when the data export was last updated.  If you’re doing queries for things like the last 30 days, check the most recent date here.
  • Age – the user enters this manually, so it’s not terribly reliable as I discovered earlier.
  • AboutMe – I’m using an nvarchar(max) field here, but you can go with a shorter field like nvarchar(2000).
  • UpVotes and DownVotes – the number of votes this user has cast.

Posts Table

In StackOverflow, questions and answers are both considered posts.  If a record has a null ParentId field, then it’s a question.  Otherwise, it’s an answer, and to find the matching question, join the ParentId field up to Posts.Id.

  • Id – primary key, identity field from the original StackOverflow database.
  • Title – the title of the question.  Answer titles will be null.
  • OwnerUserId – joins back to Users.Id.  If OwnerUserId = 1, that’s the community user, meaning it’s a wiki question or answer.
  • AcceptedAnswerId – for questions, this points to the Post.Id of the officially accepted answer.  This isn’t necessarily the highest-voted answer, but the one the questioner accepted.
  • Tags – okay, time to blow out of the bullet points for a second.

StackOverflow limits you to five tags per question (answers aren’t tagged), and all five are stored in this field.  For example, for question 305223, the Tags field is “<offtopic><fun><not-programming-related><jon-skeet>”.  It’s up to you to normalize these.  If you’d like to normalize them out into a child table, check out the usp_ETL_Load_PostsTags stored proc, which creates a PostsTags table with PostId and Tag fields.  Each Posts record (questions only) will then have several child records in PostsTags.

Next, check the contents of the Tag field carefully.  StackOverflow allows periods in the tag, like the .NET tag and ASP.NET tag.  However, in the database, these are stored as “aspûnet”.  Just something to be aware of.

Comments Table

  • Id – primary key, identity field from the original StackOverflow database.
  • PostId – the post parent for this comment.  Joins to the Post.Id field.
  • UserId – who left the comment.  Joins to the User.Id field.

Badges Table

  • Id – an identity field for a primary key.  This number is meaningless – I just added it for some referential integrity.
  • UserId – joins back to Users.Id to show whose badge it is.
  • Name – the name of the Badge, like Teacher or Nice Answer.
  • CreationDate – when the user achieved the badge.

Votes Table

This stores the votes cast on posts, but the key field is VoteTypeId.  The VoteType table wasn’t included in the export, so this table isn’t too useful yet, but if the guys give me the OK I’ll post the contents of that table here.  The Votes table doesn’t include *who* cast the votes, and I’ve got my hands full analyzing the other tables anyway, so I haven’t been interested in the VoteTypes yet.

All of the Id fields except for Badges.Id are from StackOverflow’s original database.  In theory, these numbers will not change, which means if you build your own child table structures like UserBaconPreferences, and you join via User.Id, you should be able to blow away and reload the Users table with every new StackOverflow database dump.  That’s the theory, but in reality, you shouldn’t rely on anybody else’s ID fields, because there’s no reason to believe these won’t completely change down the road.  Who knows – Jeff might switch over to GUIDs as primary keys.

Sample Questions Query

Once you’ve got it all together, you can do some fun stuff. Let’s look at some overall statistics about questions (not answers):

SELECT COALESCE(COUNT(DISTINCT p.ID),0)           AS Questions
       ,COALESCE(AVG(p.Score * 1.00),0)           AS AvgScore
       ,COALESCE(AVG(p.ViewCount * 1.00),0)       AS AvgViewCount
       ,COALESCE(COUNT(DISTINCT p.OwnerUserId),0) AS DistinctQuestioners
       ,COALESCE(AVG(p.AnswerCount * 1.00),0)     AS AvgAnswerCount
       ,COALESCE(AVG(p.CommentCount * 1.00),0)    AS AvgCommentCount
       ,COALESCE(AVG(p.FavoriteCount * 1.00),0)   AS AvgFavoriteCount
       ,COALESCE(COUNT(ClosedDate),0)             AS ClosedQuestions
       ,COALESCE(AVG(u.Reputation * 1.00),0)      AS AvgQuestionerReputation
       ,COALESCE(AVG(u.Age * 1.00),0)             AS AvgQuestionerAge
       ,COALESCE(AVG(u.UpVotes * 1.00),0)         AS AvgQuestionerUpVotes
       ,COALESCE(AVG(u.DownVotes * 1.00),0)       AS AvgQuestionerDownVotes
FROM   dbo.Posts p
       INNER JOIN dbo.Users u
         ON p.OwnerUserId = u.Id
WHERE p.Tags IS NOT NULL

And some of the results are:

  • Questions – 176,137
  • Average Score – 1.89
  • Average View Count – 311
  • Distinct Questioners – 39,795 (meaning anyone who has asked a single question has asked an average of 4.4 questions – there may be some odd stuff in here around anonymous questions though, haven’t looked at that yet)
  • Average Answer Count – 4
  • Average Comment Count – 2.3
  • Closed Questions – 3,656 (or 2% of all questions)
  • Average Questioner Reputation – 1,506
  • Average Questioner Age – 30 (but remember, that’s unreliable)

I’m just getting started playing with it, and I’ll have a fun new StackOverflow statistics toy available for everybody to play with in a couple of days.  In the meantime, you can download the StackOverflow database dump via BitTorrent and download my ETL stored procs.

Update: Sample StackOverflow Queries in the SQLServerPedia Wiki

Jon Skeet had an excellent idea: we need a wiki to store interesting queries.  Wouldn’t you know, I happen to run one!  I added a section in SQLServerPedia for sample StackOverflow database queries.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

StackOverflow Data Mining: Cleansing the Data

The first stage of mining is a dirty, ugly business.

My Datacenter

My Datacenter

Miners don’t emerge from tunnels bearing armfuls of shiny diamonds.  They come out with filthy, misshapen rocks that might be something valuable – but might be worthless junk.  There’s no way to tell what you’ve really got until you’ve spent some time analyzing and polishing.

Take one of my early findings in the StackOverflow database export: the average age of StackOverflow users is 31, but in May, the average age of the person asking a question tagged “hook” was 59.  That’s a serious deviation.  At the other end of the scale, people asking questions tagged “ec2″ had an average age of, uh, zero.  While there is the possibility that RockhardAwesome is hard at work building virtual machines in Amazon Ec2, I’m voting that one down.

That’s what I get for jumping into mining without cleaning off my rocks first.

Out of the 86,110 users in the database export, only 22,747 provided their age – and the key phrase is “provided their age.”  You can’t trust any data you get from human beings, especially these particular folks:

Ed – Age 256
svec – Age 109
deuseldorf – Age 89
Coding the Wheel – Age 89
Keng – Age 89
Will Dean – Age 89
kokos – Age 89
ColinYounger – Age 89
Lars Truijens – Age 89
dydx – Age 89
Confused Computer Guy – Age 89
Ian Kelling – Age 89
davr – Age 89
Smirking Liberal – Age 89
Sam Meldrum – Age 89
DrStalker – Age 89
Frans – Age 89
Mark Bessey – Age 89
Tony Andrews – Age 89
Pat – Age 89
J-P – Age 89
Simon – Age 89
danb – Age 89
dhislop – Age 89
Matt Rogish – Age 89
Josh – Age 89
pozdziemny – Age 89
chinna – Age 89
Alan Storm – Age 89
Joseph Ducreux – Age 89
jamesh – Age 89
toobstar – Age 89
markd – Age 89
Atif Aziz – Age 89
Peter Boughton – Age 89
que que – Age 89
DJ – Age 89
Cliff – Age 89
gaoshan88 – Age 89
King Avitus – Age 89
alden – Age 89
Alan – Age 89
yx – Age 89
ElephantMoss – Age 89
Loki – Age 89
Tautologistics – Age 89
Alkini – Age 89
h_power11 – Age 89
Click Upvote – Age 89
Salty – Age 89
Sean James – Age 89
kenneth – Age 89
ysangkok – Age 89
Pod – Age 89
Edward – Age 89
MedicineMan – Age 89
Heikki Toivonen – Age 89
Stuart – Age 89
ForceMagic – Age 89
Jane Sales – Age 89
hanesjw – Age 89
xx – Age 89
Silfheed – Age 89
noob source – Age 89
Snickers – Age 89
davefb – Age 89
markti – Age 89
sampablokuper – Age 89
afitzpatrick – Age 89
mishac – Age 89
Computer Security – Age 89
oofoe – Age 89
Tyler Egeto – Age 89
jeffa00 – Age 89
Nikola Jevtic – Age 89
Dave – Age 89
monkeysword – Age 89
wowus – Age 89
sgargan – Age 89
saidireddy – Age 89
Bobby Fever – Age 89
Zaakk – Age 88
Gary – Age 88
rlb.usa – Age 88
tan – Age 88
Kieranmaine – Age 88
Ainab – Age 88
Sleep Deprivation Ninja – Age 88
joelhardi – Age 87
Simon H – Age 86
Nick Hildebrant – Age 86
alanl – Age 84
Dustin – Age 81
jeffamaphone – Age 80
molf – Age 80

I applaud these folks for their civil disobedience, and curse them for same.  There’s an interesting underlying correlation: people who ask questions about hooks seem to be more likely to lie about their age.  I’ll leave that as an exercise for the reader.

On the bright side, I’ve found some other interesting bits of data, although these are still very much rocks that haven’t been cleansed yet:

  • Questions tagged beginner get significantly higher upvotes than other questions (avg 391, sitewide avg 120), which might indicate that if you wanted an upvoted question, write one for beginners.
  • Questions tagged routing, resources, video, programming or google are favorited more than twice as often as the average.
  • Questions tagged svn are asked by people who do more downvoting than other users (avg 18, sitewide avg 10).  Conversely, questions tagged vim or interop are asked by people who do more upvoting (avg 324 and 303, sitewide avg 119.)
  • Questions tagged homework are asked by younger users (avg age 24, sitewide question avg 29).  Makes sense.

I’ll dig more into this tomorrow, but now I’m off to see my dad to celebrate his 60th birthday.  Hmmm – you know, come to think of it, I haven’t actually seen his driver’s license…

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

Data Mining the StackOverflow Database

StackOverflow released a public dump of their database this morning. Jeff Atwood and the guys believe that if you, the community, are putting the work into this huge body of knowledge, then you should be able to have rights to use it.

This is a great dataset to show off one of my favorite toys from the Microsoft SQL Server Data Mining team. In this half-hour video, Tom LaRock and I will walk you through data mining the StackOverflow user list to find out more about the users and see what makes the rockstar high-reputation users different from the worker bees like me.

If this looks interesting to you, here’s what else I’ve been doing with the StackOverflow data:

Now, back to what I did in the video – let’s talk about the tools I used.

Microsoft’s Free Data Mining Tools

For today’s demo, I’m using SQL Server Analysis Services installed on my desktop. Relax – it’s really easy. Literally just install SQL Server 2005 or 2008 Developer Edition, check the box for Analysis Services, and use the defaults. You don’t have to know what you’re doing in order to get it up and running, and it just runs in the background as a service. After you’re done playing around, you can stop the service and set it to manual to prevent it from sapping your system resources. Go into Control Panel, Administrative Tools, double-click on the SQL Server Analysis Services service, and change the startup type to Manual.

Depending on your version of SQL Server and Excel, you’ll need one of these free plugins from Microsoft:

If you want to avoid the whole SQL Server Analysis Services thing altogether, you can also use Microsoft’s free SQL Server Data Mining in the Cloud plugin. Be aware that it’s a technical preview, not a fully supported & released product. Their cloud servers can (and do) go down. Also know that your data is going into the cloud, which has its own ramifications as I’ve discussed in my previous cloud data mining tutorial.

What’s Coming Next: SQL Server 2008 R2 with BI in Excel

In the next version of SQL Server, Microsoft will deliver business intelligence to end users through Excel. At the Professional Association for SQL Server Summit last November, Donald Farmer demoed slicing and dicing of huge spreadsheets with real-time analytics that previously would have required some pretty hefty hardware.

Excel 2007 has a million-row limit, but the forthcoming version will not. Some of the StackOverflow export tables like Votes have more than a million rows, so we can’t yet data mine those using Excel as a front end, but we can play with the Users table today.

Subscribing or Downloading My Podcasts

If you have an MP3 player or a portable video player and you want to download my podcasts automatically, you can subscribe to the SQLServerPedia podcast feeds here:

You can also download this video to watch it later:

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

Index Fragmentation Findings: Part 2, Size Matters

Last week, I blogged about the basics of SQL Server index fragmentation: why it happens, how to fix it, and how often people are fixing it.  I left you with a cliffhanger: it seemed that the frequency of defrag jobs didn’t appear to affect fragmentation levels:

  • Databases with no index defragmentation were an average of 5% fragmented
  • Monthly – 17% fragmented
  • Weekly – 3% fragmented
  • Daily – 6% fragmented

At first glance, that would seem to indicate that your database got worse off if you defragmented! But like all good novels (and most bad ones), the plot thickens.

Enter Data Mining with Excel and SQL Server

Data Mining with Open Source Tools

Your Grandfather In His Cubicle

Data mining is a lot like diamond mining, only there’s no monopoly on the market, and the ladies don’t seem to appreciate a quality KPI.  Otherwise, they’re identical: there’s a whole lot of money in it, but that money doesn’t usually go to the people who do the actual mining.  It goes to the executives and salespeople who take advantage of the mined products to make better decisions.

The people doing the mining, on the other hand, are forced to spend their lives in tiny, dark caves (or “cubicles”) trying to extract beautiful gems (or “data”) while risking painful lung ailments (or “carpal tunnel”) due to toiling with terribly unsafe and outdated hardware (or “hardware”).

For today’s demo, I will be the miner, and you’ll be the executive who takes advantage of my work. (It’s okay, I’m used to it – I work for a vendor now.)

In my podcast Data Mining with Excel in Four Minutes, I explained how to set up Microsoft’s free data mining add-ins for Excel 2007.  It’s an Excel plugin that hooks up to any SQL Server Analysis Services server on your network, either SQL Server 2005 or 2008, and makes data mining a point-and-click affair.  It doesn’t require high-end horsepower – even a desktop or laptop works great for this.  If you can’t be bothered to set up an SSAS instance, then check out my Data Mining in the Cloud writeup on how to get started without using a server at all.

Help SSAS Help You: Explain Your Numbers

While data mining is really easy to set up, you can get much better results if you “prequalify” your data and turn some of the numbers into basic categories.

If I was working with United States salary data, for example, my source data might have a column for Hourly Wage.  I would add another column and call it Tipped Employees:

  • Under $6.55 per hour – Tipped Employees  = Yes.  You can pay someone less than minimum wage if they get tips, and in that case, you really just can’t go by their hourly wage alone.
  • $6.55 per hour and over – Tipped Employees = “Unknown.”  In a perfect world, I’d have enough data to find out if these people get tips, but that’s not always the case.

By adding a new attribute to my data, something that’s not clear from the numbers alone, I might get better insight from my data mining efforts.

By the way, if you’re reading this and it’s after July 2009, the minimum wage has risen to $7.25 per hour.  If you’re a VB developer, you should immediately ask for a pay increase to match the new standard – unless of course they’ve got a tip jar by your desk.

Explaining Our Index Fragmentation Numbers

In the case of our index fragmentation numbers, one of the source data fields is Page Count – the number of pages that an object has.  Size matters with fragmentation: small objects with only a handful of pages may appear to have very high fragmentation numbers, but they can’t actually be defragmented.  There’s only so much defragmentation you can do when a table only has three pages.  I’ve actually been on support escalation calls where customers demand to know why a defrag job doesn’t reduce all types of fragmentation to absolute zero, even for tables with just one page.

Microsoft’s best practices on SQL Server 2000 index defragmentation notes that:

“Generally, you should not be concerned with fragmentation levels of indexes with less than 1,000 pages. In the tests, indexes containing more than 10,000 pages realized performance gains, with the biggest gains on indexes with significantly more pages (greater than 50,000 pages).”

With that in mind, I added a Page Count Group column and calculated it with a formula:

=IF(Table1[[#This Row],[page_count]]>50000,"Large",(IF(Table1[[#This Row],[page_count]]<10000,"Small","Medium")))

That adds a text label for Small, Medium or Large depending on the size of the table.

Suddenly, The Data Makes More Sense

Fragmentation Pivot Table

Fragmentation Pivot Table

Even before doing data mining, if we just add a Pivot Table, we can suddenly make more sense out of the numbers.

For Large tables, we see an average 44% fragmentation when the database has no defragmentation jobs set up.  Monthly defrag drops that to 14%, and daily drops it to just 2%!  The Weekly data is a bit of an outlier here, but it’s still less than no defrag jobs at all, so we’ll have to dig deeper.

For Medium tables, we see the type of data distribution we would hope for: the more often we defrag, the lower our fragmentation gets.

For Small tables, the data is all over the place, but we know why: it has to do with the way smaller tables behave.

Adding this bit of human interpretation helped us get better results from our data – and we haven’t even started mining!

More Reading on SQL Server Fragmentation

If you liked this article, check out:

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

Contest: Data Mine the DMVs

The Microsoft SQL Server Data Mining team is looking for ways to help DBAs with data mining, and they came up with an interesting idea: data mine the DMVs to find interesting information.  I’m going to start by data mining index fragmentation statistics, and I need your help.

I’ve got a DMV query to gather information about indexes and index fragmentation.  It dumps out a list of objects (without names) and their fragmentation percentages.  I need you, buddy old pal, to run this query on your databases, export the results to CSV, and send it to me.  If you could, include a note about whether or not you use any index maintenance jobs, like SQL’s built-in index rebuild maintenance plans.

Your submitted data will be completely confidential – I’ll put it into my data warehouse without any information to tie it back to you.  It won’t even go to the Quest or Microsoft offices – actually, you’ll be emailing it to brento@brentozar.com because I don’t have a big enough inbox at Quest anyway, hahaha.  Yes, Quest sells index defrag software, but no, I won’t pass your contact info on to the sales team.  (I only have one reputation, and it’s worth more to me than my salary.)

I’ll draw names this Monday, January 26, and 5 random winners will get a SQLServerPedia t-shirt (or, if you prefer, one of my Twitter shirts).  To enter, run the below script on as many databases as you want, store each db’s results in a CSV file, and email it to me.  You get one entry per database, and yes, you can win more than once.  Now would be a good time to grab your nearest developer, tell them you want to train them to be a DBA, and get them to do all the querying for you!

The overall findings will be published in blog entries at SQLServerPedia.

Here’s the script.  This will grab some locks while it does the analysis, so I wouldn’t run this on your high-volume production boxes except after hours.  It only produces data for the current database, so the easiest way to run it is to set SSMS to export results to file, and then run it in each database.  Yes, this is a hassle.  No, I don’t expect you to do it in every database you have.  Don’t complain – if it was painless, I wouldn’t be offering t-shirts, hahaha.

SELECT  ix.[type]
,ix.[type_desc]
,ix.[is_unique]
,ix.[data_space_id]
,ix.[ignore_dup_key]
,ix.[is_primary_key]
,ix.[is_unique_constraint]
,ix.[fill_factor]
,ix.[is_padded]
,ix.[is_disabled]
,ix.[is_hypothetical]
,ix.[allow_row_locks]
,ix.[allow_page_locks]
,ps.*
FROM sys.dm_db_index_physical_stats (DB_ID(), NULL, NULL , NULL, N’Limited’) ps
INNER JOIN sys.indexes ix ON ps.object_id = ix.object_id AND ps.index_id = ix.index_id

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

PASS Session on Analytics by Donald Farmer (#sqlpass)

George Box said, “Essentially, all models are wrong, but some are useful.”  Donald illustrated this point with a brief history lesson.  Early models of the universe said the sun revolved around the earth, and with that model, marvelous things were possible in architecture, celestial navigation and science.  Even though the model was wrong, it was still extremely useful.  Data modeling holds to that same statement: your models will never be right, but your key is to figure out which ones are useful enough.

Donald holds four things to be important: data mining must be actionable, innovative, trustworthy and seamless.  I think the innovative part is what keeps data mining in the back closet: data mining gives companies an edge, and when they get good at it, they don’t necessarily show it off for fear of losing their newfound competitive advantage.

He noted that last year, Microsoft’s BI lifecycle charts had steps for integration, reporting and analytics.  The new ones include steps for data entry, because that’s also a part of the process.  This points to how Excel is being integrated into the process because end users have data of their own that they weren’t necessarily integrating into the data mining lifecycle.  The IT team might not be able to integrate it in time, and the user wants to go go go.  The users want to take our historical data in the data warehouse, toss in some new data (like maybe about a new ad campaign) and make predictions about what’s going to happen.

He explained that clustering is the science of finding bad data by looking for outliers.  For example, your income data might fall into a model like this:

  • Young people have low income
  • Middle-aged people have high income
  • Older people have low income as they go into retirement

If Britney Spears comes in to apply for a loan, her data might be an outlier.  In data mining, you need to figure out if it’s valid data or not.  Malcolm Gladwell’s recent book Outliers wasn’t mentioned in the speech, but for a pop science version of data mining, take a look at it.

Donald showed how to use analytics to predict outliers in a web form and explained that you don’t have to hard-code the rules to find outliers.  For example, you don’t have to hard-code ages and income ranges.  That’s helpful in case your business changes dramatically – like if you merge with a bigger company with more customers, or move geographic ranges – you don’t have to recode your hard-coded business rules.  The database engine just uses the analytics to determine the rules.

In SSIS, you can add a data mining query step which is essentially making a prediction.  It calls out to an SSAS mining model to guess what your missing values are (or to create additional values) as part of the data flow.  You might have a sales promotion to be emailed to customers aged 25-35, but not all of your source data has the customer’s age.  A data mining task in SSIS could fill those gaps in your data.

He demoed the Excel Table Analysis Tools with the passenger manifest from the Titanic.  Very funny.  Later on, after the session, he happened to run across me and some other DBAs sitting in a hallway with our Macbooks open.  One question led to another, and next thing you know, he had his laptop open and we were data mining the Titanic survivors to see if older women were more likely to survive than younger women.  (As it turns out, the answer was no.)

Jeremiah Peschka of Facility9.com summed it up when he turned to me during the session and said that he wanted to go find out as much as he possibly could about data mining.  I’m fall into that same cluster, so to speak – data mining isn’t my “job” either, but it has so many cool benefits that I have to figure out how to integrate it into my workflow.  That was essentially the message in Donald Farmer’s session: predictive analytics works best when it’s an integral part of our daily jobs, built into the tools we use every day.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

SQL Server Data Mining in the cloud!

Fun stuff from the SQL Server Data Mining team – they’re showing off SQL Server data mining in the cloud.  You can connect to a Microsoft SQL Server Analysis Services machine from your local desktop with an Excel 2007 add-in, analyze your data on their servers, all at no charge (for now).

I just published a short demo of SQL Server Data Mining in the cloud on SQLServerPedia.com.

I am reeeeeally impressed.  Wiping the drool off my keyboard.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts