Clustering entities in a project’s data mining domain is another of the more exploratory exercises I do for predictive analytics engagements. By “clustering”, I mean running the entities such as customers, products, or locations through the Cluster algorithm (at least that’s what it’s called in the SSAS world) before building the real predictive models.
Related to the notion of such exploratory exercises, please see my blog on using the Association algorithm to browse for correlations during the early modeling phases.
Many people see this exercise as a waste of time expecting all the goodies to come by simply letting the data to do the talking. Although there are occasions where an easily obtained dataset (ie doesn’t involve significant data prep – just a dataset such as a sales transaction table sitting out there) on its own does provide great data mining insights, that low-hanging fruit is limited and is becoming increasingly rare. Here are several enticing benefits of performing this preliminary clustering exercise:
- It often will reveal more effective segmentation of entities than the “Categories” that come with the entities (ex: an employee category in the database table). Meaning, the members of the clusters will behave more alike in relation to what we are trying to predict.
- The discovered similarities among members of the clusters can shed insight on sources for additional data that could improve the prediction performance or on forces driving what we are attempting to predict.
- Clustering could distill many attributes into a single attribute. For example, we could include in a decision tree an attribute named “Soccer Mom” as opposed to all the attributes that go into determining if one is parent who purchases a minivan for the purpose of shuttling around many children.
- Clustering could discover outlier entities. These could be entities who end up having no strong probability to any cluster while most others do. For example, a customer may not have an outlier level of purchases, but may be so unlike everyone else that it’s better to remove that customer from the data mining. An example would be a customer purposefully trying to undermine the data mining through engineered actions.
Categorizing things is the basis for how humans learn. It’s how we’re able to leverage our experiences to deal with whatever we’re currently facing. It’s how we infer that creatures with wings can fly even though we never saw the creature before much less actually witnessed it flying or that an unidentified brown-colored soda will probably taste like cola and not strawberries.
Our brains are constantly in the process of sorting whatever objects, physical or conceptual, we are currently experiencing into buckets we’ve shaped through years of our unique experiences. The value of this ability is that as we encounter new things through our daily lives, such as people or experiences, we can predict with some level of certainty that this new thing will act and react like these similar things we’ve seen before.
That’s a profoundly powerful tool. For example, if we encounter an animal of a type we’ve never seen before, how should we behave? If it’s growls and has big teeth, we should act as if it were a bear or cougar, attempting to scare it off. That behavior wouldn’t be wise if it resembles something we’re seeking to eat such as a deer or rabbit, which would be frightened away it we treated it like a cougar.
Categorizing is a process that is a mix of conscious and subconscious thought. They may be conscious because we’re trained to spot the characteristics of certain things. If we’re a doctor, we’re trained to recognize symptoms. Most of our categorizing is subconscious. We think about it no more than we think about driving while we’re driving.
Consciously or subconsciously, we assign labels to everything ranging from its color to its cost. However, in Business Intelligence or Predictive Analytics we usually think of a category as an entity attribute actually named “Category” or “Class” or “Sub-category”; for example, the category of a product. Everything else is just “something we know about it”. The problem is that these simple labels in themselves don’t provide much value towards inferring something. For example, do all blue things behave similarly or are all expensive things purchased in similar patterns?
In reality, we subconsciously categorize things into concepts or metaphors much more complex than height or hair color. For example, we may categorize Pete as being like Robert, which means much more than “Pete looks like Robert”. “Robert”, being a human, is a complicated concept. It means given the same action, I expect Pete to react similarly to Robert.
If we examined the category called “Robert”, we would find that it is composed of attributes that are relevant to things of predictable value, or at least have proven to be so in the past. The attributes may include Robert’s demeanor, the way he dresses, his job, the number of children he has, or his culture. Things of “predictable value” refer to the results of actions I may take on Pete such as hiring him.
Such categorizations are usually subconsciously made. Unless I’ve just taken a Myers Briggs course, I don’t believe I consciously assess who someone I’m just meeting reminds me of so that I can bestow all the attributes of that person upon this new person. I know it happens though because if it didn’t happen I would probably just start treating all people exactly the same way, which in theory may sound like a good thing, but in practice it isn’t.
On the other hand, creating categories such as product or employee categories are mostly a conscious effort. Such categories are created by business analysts and are based on a combination of factors such as legal requirements, physical similarities, and perceived utilization. Once these categories are created, they can create inertia similar to those of first impressions of people which are hard to subconsciously modify. Departments and distribution systems could be built around the categories presenting significant friction in the face of a changing world.
The problem with these made up categorizations is that they are usually not really up to date. They were probably made up years ago and since then the items in the category have probably independently evolved. Sort of like identical twins starting out with the same genes, but diverging into very distinct people. Products can change due to new and removed features driven by changing market needs. There can be new uses for the product which drives how it is marketed.
So when I begin a predictive analytics engagement, I’m usually advised on the tremendous value of some “category” attribute. It probably does add value to the quality of predictions to at least some extent. However, at this point it’s important to remember these two things:
- The “value” it contributes isn’t a black or white, yes or no thing. It’s a “for some yes, for some no, for some maybe” thing.
- The purpose of categorizing in the context of predictive analytics is to be able to make inferences: If it looks like a bird, it can fly.
For superior predictive analytics, we need to create categories where the members do indeed belong. Some members of a category may indeed still fit, some may not. Think of the one sibling who moves to the big city and can never really go back home. The members that don’t fit anymore will not just muddy up the patterns providing inferior predictions for the members who do belong, but these unfit members will receive predictions that could be comically wrong.
People learn by constantly and effectively categorizing and re-categorizing. If our ability to categorize is faulty (not malleable – set in our ways; out of date – lazy; or wrong – ignorant) we’ll usually come to wrong conclusions. The same holds for predictive analytics. Running the entities of the domain through the clustering algorithm at the beginning and periodically thereafter will help ensure the categorization of the entities are up to date and the prediction models make better inferences and subsequent predictions from them.
Validating the effectiveness of the clusters is a subject for a subsequent blog; in fact, part of a multi-part series I hope to write over the next couple of months.
All these exploratory exercises (this clustering suggestion as well as the association suggestion) may seem like a lot of trouble for those who may have thought Predictive Analytics is ready for everyone. From a consumer’s point of view it is since it touches all of our lives every day whether we know it or not.
From a practitioner’s point of view, I can’t say it is for everyone yet once you get past kicking the tires stage (for example, taking a silo of data such as a transaction table and running through one of the algorithms). Predictive Analytics is not Artificial Intelligence; at least not just yet. Meaning, doing something such as predicting the future, the toughest thing to ask of a human or computer, is a partnership between humankind and machine, where humans still do the bulk of the “skill” work. When the machines eventually do the vast majority of the skill work, I’d say it’s then called Artificial Intelligence.
On the other hand, Predictive Analytics is much more accessible to non-Quants than many texts seem to need to convey with lots of scary math and seemingly nonsense words. Sure, writing your own algorithms does require a great deal of advanced math, but I never believed one needed crazy math skill to build highly usable data mining models. A great deal of common sense (the kind Billy Ray Valentine has in “Trading Places”), a mastery of database technologies (ETL is still where the heavy lifting is at), lots of patience and tenacity, and the new generation of tools such as the combo of Predixion Software’s Insight and PowerPivot will get you there for most practical purposes.