George Box said, “Essentially, all models are wrong, but some are useful.” Donald illustrated this point with a brief history lesson. Early models of the universe said the sun revolved around the earth, and with that model, marvelous things were possible in architecture, celestial navigation and science. Even though the model was wrong, it was still extremely useful. Data modeling holds to that same statement: your models will never be right, but your key is to figure out which ones are useful enough.
Donald holds four things to be important: data mining must be actionable, innovative, trustworthy and seamless. I think the innovative part is what keeps data mining in the back closet: data mining gives companies an edge, and when they get good at it, they don’t necessarily show it off for fear of losing their newfound competitive advantage.
He noted that last year, Microsoft’s BI lifecycle charts had steps for integration, reporting and analytics. The new ones include steps for data entry, because that’s also a part of the process. This points to how Excel is being integrated into the process because end users have data of their own that they weren’t necessarily integrating into the data mining lifecycle. The IT team might not be able to integrate it in time, and the user wants to go go go. The users want to take our historical data in the data warehouse, toss in some new data (like maybe about a new ad campaign) and make predictions about what’s going to happen.
He explained that clustering is the science of finding bad data by looking for outliers. For example, your income data might fall into a model like this:
- Young people have low income
- Middle-aged people have high income
- Older people have low income as they go into retirement
If Britney Spears comes in to apply for a loan, her data might be an outlier. In data mining, you need to figure out if it’s valid data or not. Malcolm Gladwell’s recent book Outliers wasn’t mentioned in the speech, but for a pop science version of data mining, take a look at it.
Donald showed how to use analytics to predict outliers in a web form and explained that you don’t have to hard-code the rules to find outliers. For example, you don’t have to hard-code ages and income ranges. That’s helpful in case your business changes dramatically – like if you merge with a bigger company with more customers, or move geographic ranges – you don’t have to recode your hard-coded business rules. The database engine just uses the analytics to determine the rules.
In SSIS, you can add a data mining query step which is essentially making a prediction. It calls out to an SSAS mining model to guess what your missing values are (or to create additional values) as part of the data flow. You might have a sales promotion to be emailed to customers aged 25-35, but not all of your source data has the customer’s age. A data mining task in SSIS could fill those gaps in your data.
He demoed the Excel Table Analysis Tools with the passenger manifest from the Titanic. Very funny. Later on, after the session, he happened to run across me and some other DBAs sitting in a hallway with our Macbooks open. One question led to another, and next thing you know, he had his laptop open and we were data mining the Titanic survivors to see if older women were more likely to survive than younger women. (As it turns out, the answer was no.)
Jeremiah Peschka of Facility9.com summed it up when he turned to me during the session and said that he wanted to go find out as much as he possibly could about data mining. I’m fall into that same cluster, so to speak – data mining isn’t my “job” either, but it has so many cool benefits that I have to figure out how to integrate it into my workflow. That was essentially the message in Donald Farmer’s session: predictive analytics works best when it’s an integral part of our daily jobs, built into the tools we use every day.