When starting on a Predictive Analytics engagement, a data mining modeler may need help getting pointed in the right direction: “Here, find me something good.” This may be the case even after countless hours of interviewing subject matter experts and DBAs1. This is because ideally we would like to find things hidden in a mountain of data that people don’t already know. In such cases, the SME can only provide guidance, much like how Sacagawea provided guidance to Lewis and Clark.
At this stage of the game, these are the things to keep in mind:
- The purpose of predictive analytics is to help us make better decisions towards some goal.
- Goals are the purpose for the existence of a business or any sort of process or system, from making coffee to winning a war.
- A decision is an action we intend to take. When we actually execute on that decision, a series of events happen which impact the system.
- The impact consists of good and bad consequences – each of which are an event. A good decision has minimal bad consequences or at least doesn’t profoundly break something.
- Every event is the result of a set of conditions (each of which is an event itself).
- The idea behind Predictive Analytics is that we automatically identify sets of conditions that lead to events of interest.
Item #6 means, there are rules for everything that happens in the world. Water will freeze when the air pressure is one atmosphere and the temperature hits 32 F. I will go to work when it’s a workday, I’m physically capable of making the journey, and of course, if I have a job. I will lose weight when my intake of calories is less than my expenditure.
All of those examples in the previous paragraph are obvious to us, but Predictive Analytics is used to discover rules not obvious to us and then apply them to our decision-making process. However, since Predictive Analytics is not Artificial Intelligence, the effort still requires some human guidance. We need to ensure the effort does provide value towards the goals and that the necessary data is available and in a format consumable by the data mining algorithms.
Faced with the mountain of data, I take a step back and ask a generic question: What events happen together? A fundamental concept of common sense is “what fires together wires together”. I look at everything as an event. Purchasing lots of potato chips is an event. Purchasing lots of beer is an event. Having your blood pressure checked is an event. Being diagnosed with hypertension is an event. Being prescribed Capotin is an event. Failing to fill the prescription is an event. Then the person dying at age 45 is an event. One could even say that the human in question is an event of all his parts existing in that configuration.
All of those events happen to be somewhat related. Meaning, a bad diet, hypertension, and an early death often go hand in hand. In other words, a bad diet and hypertension is a good predictor for early death. But each of those factors probably exist in separate database tables. For example, my purchases are in some sales database, my doctor visits in another, and a record of my visits to the pharmacy in yet another. How can I find relationships such as this from across a sort of arbitrary set of tables?
One way is to combine a set of columns from across a set of tables into a table of object/value pairs and running that table through the Predixion Software’s Associate function or the Analysis Services Association algorithm. The association algorithm is most popularly used for a widely known technique called the “Market Basket Analysis” which discovers products commonly purchased together.
The difference here is that this classic use case utilizes a “purchases” (or sales) table which is conveniently already just a single table and easily plugged into the Association algorithm. But if I wanted to correlate purchases to a deeper level, I may want to include data only available in other tables such as the type of day of the purchase (holiday, weekend, weekday). With that information, I could then see things such as beer and vodka has a stronger correlation on holidays than beer and pizza on weekdays (just making this up).
In that case, I’d need to add the type of day to my object/value table, which is probably in another table (a Dates table). Now, I would need to do some data preparation (data prep). Fortunately, the data prep isn’t difficult, requiring only relatively simple SQL that UNIONs attributes from various tables:
The idea is to create one “bag o’ attributes” of all sorts; a narrow table of cases and the an “open set” of the cases’ associated characteristics. UNION whatever attributes you think may provide insight. This stage is about exploring for relationships. Predictive Analytics is about relationships and the strength of the relationships.
Most of the relationships will probably be obvious to the subject matter experts. But there should be relationships that will make the SMEs say, “I thought that may be the case” (confirming a suspicion with empirical evidence) or even “I didn’t know that” (learning something new). Discovery of new relationships could form the basis for entirely new strategies or at least push on with a current strategy with confidence that it is the right path.
For the data mining modeler, the relationships point to areas worth exploring to a deeper level. This entails the procurement of more data and most likely more data prep. But at least there are clues as to where to focus valuable time and effort.
For another example of how I use the Association algorithm, please see my old blog, Picking Stocks with the Association Algorithm.
1 Note that in the Predictive Analytics world, “DBA” is now an overloaded term. It can mean “DataBase Administrator” or “Doctorate of Business Administration”.