The Effect Correlation Score for KPIs

Introduction

Everything we do is intended to move us towards achieving some set of goals ranging from just satisfying our immediate hunger to achieving a revolution. These actions are executed within a strategy (or hierarchies of strategies) to varying extents of sensibility and organization. Presumably the effectiveness of our efforts towards those goals and the rhyme and reason behind why be do them are constantly assessed. After all, the world is ever-changing. It is a complex world where everything we do doesn’t take place in a vacuum and it is rife with the constraints of finite resources (costs – time, money, materials, other opportunities) that force us to find creative alternatives. Further, unlike our lives on the living room couch, activities in our work lives usually involve many people and other moving parts, and failure comes with consequences.

So we operate the execution of our enterprise strategies (activities in our work life) within a formal Performance Management framework in order to maximize our chances for success. The intent of Performance Management is to align the efforts of everyone and everything within an enterprise towards its goals. Properly implemented, we ensure we understand our purpose, our goals, develop coherent strategies, identify, assess and mitigate risks, understand how to measure the effectiveness of our efforts, and develop sets of contingencies.

The problem is that development of the strategy and properly thinking through it seems more like an enterprise chore than the primary key to success that it is. Strategy development tends to place too much faith in trends of the recent past and that the actions of our strategies will not trigger cascading, unanticipated effects. Cutting intellectual/analytical corners by failing to validate the strength of beliefs in the business (relationships between tactics and outcomes) or by failing to place more realistic effort into risk mitigation (too optimistic about things that haven’t yet happened) can easily derail the Performance Management effort.

The strategies also need to be novel in order to compete. I’d love a world where I knew exactly what everyone else is going to do because it’s a “best practice” (and therefore the responsible course of action). In such a world, that means our statistics-based predictive analytics models would work surprisingly well. Well, they sometimes do, then quickly stop working after everyone learns about the phenomenon and alters the statistics from which the model is based.

Developing strategy relies on our arsenal of cause and effect built up in our heads over the years. We draw on those experiences to imagine what will happen if we put a series of actions into play towards a goal. It’s when we execute on that strategy that we find the world will not simply go through katas (choreographed fighting) with us. I think the way Performance Management is implemented today doesn’t possess the necessary focus on agility required to fight in a randori world. In this blog, I wish to introduce an aspect to address this in a way that avoids analysis paralysis but also serves as a better warning about the sanity of our strategies.

This blog is intended to simply introduce this concept of an “Effect Correlation Score”, something I try to include as part of performance management projects I engage in. I do think that comprehensive coverage of this concept could take up an entire book as the purpose of the ECS is to profound: ensure our efforts are not just going as planned, but are still valid. The primary take-away is that we must be cognizant of whether our endeavors towards “good” KPI statuses are still valid and sensible in a world that is constantly changing and for which “business” is usually a competition.

Current State of KPIs

The basic elements of a Performance Management implementation are the Key Performance Indicators (KPI). KPIs are just measures of values. A good everyday example of a very basic, objective KPI is the speed of your car. The value, the easily obtained value of 55 mph, in itself only tells us what we’re doing. But in our minds we know that the value also helps us estimate the arrival time at our destination, it tells us whether we are breaking any rules (the posted speed limit), it tells us how capable we are of passing a car within a given distance. Such concepts are conveyed to us through other “perspectives” of the KPI. Currently, those other perspectives include these three things:

  1. Target – This is the value we’re shooting for; the figure that we should be reaching for. In this example, if the posted speed limit is 55 mph, we should shoot for that speed.
  2. Status – This is a formula that provides a score quantifying how well we’re doing or not doing, as opposed to the Value itself, which would be what we’re doing. In this case, if the posted speed limit (the Target) is 55 mph and we’re traveling at that speed, the status is good; 45 or 65, not exactly bad, but you’re open to getting horns tooted at your or a ticket; 25 or 95, and you’re looking at a very dangerous situation, maybe involving getting rear-ended and more than just a ticket. The KPI status is often represented by a “traffic light icon”, where the green light is good, the yellow light is warning, and the red light is bad.
  3. Trend – This tells us how we’ve been doing lately. It may be that the status is currently good at 55 mph, but maybe we’ve been following a wide-load for the past hour, so that in reality, we haven’t been doing well lately. Interestingly, trends normally look at the recent past, but we can also look at the predicted near-future trend. Maybe we see the flashing lights of the wide load truck up the freeway.

I’m proposing another “perspective” of KPIs I call the Effect Correlation Score (ECS). This is a measurement of the correlation between the measure of our efforts and the effect we hope to achieve. For example, happy employees correlate to superior product quality. I will explain how strategies can lose their integrity “in flight” and hard-headedly following through on pursuing good statuses of the KPIs for which conditions have changed can be detrimental if not devastating. However, unlike target, status, and trend, the ECS is really an attribute of the relationship between two KPIs.

KPIs and Strategies in the Complex and Competitive World

Making good decisions involves a balance of learning from the past and being cognizant (even paranoid) about our assumptions. Applying what we’ve experienced in the past to similar situations today is only sensible, and for the most part, it works. However, the world is complex and things change. Additionally, reacting on what statistics tells us is the best answer is precisely what our competitors (or enemies) want us to do (see my blog, Undermined Predictive Analytics).

Strategies that we build to achieve goals are made up of webs of cause and effect. For example, at an extremely high level, happy employees leads to higher quality products, which leads to happier customers, which leads to bigger revenue. Additionally, lower costs along with the bigger revenue leads to bigger profits. Each of these links (edges) such as happy employees leads to higher quality is just a theory. It’s not always true. Maybe sometimes happy employees also leads to complacent employees. Maybe there is only so much that happiness can do to continue raising quality; even if happiness can be infinite, a product can only become so good.

There are many reasons why strategies stop working during the course of execution, beyond simply poor execution:

  • Our competitors adjust to counter our strategies. Unfortunately in business, one company’s gain is more often than not another company’s loss. If a company notices the weak points of a competitor and builds a strategy to capitalize on those weak points, the competitor will probably adjust. Therefore, that weak point, a major assumption of that strategy, fails to correlate with success.
  • Diminishing returns. Often, a tactic works because it is filling a void. For example, Vitamin C works wonders when our bodies are short of it, but the benefits of increasing Vitamin C diminishes to practically nothing very quickly.
  • Desensitization – A variant of diminishing returns. A strategy could stop working because the driving force of incentive can sometimes disappear. Would all workers initially driven to over-achieving simply from being happy to have a job continue to over-achieve?
  • Non-linear relationships. This can often be the same as diminishing returns, but this is more complicated. For example, what if Linus Pauling were correct and that mega doses of Vitamin C kick in other mechanisms supporting health in other ways than avoiding scurvy? We would see a rising benefit, a tapering, then a rise again.
  • People learn how to game the KPIs. An example is a call center agent goaled on successful resolutions who hangs up as soon as the call seems to be very difficult.
  • Managers mistakenly “blow out” the KPI, thinking if X is good, X*2 is great. In this case, managers run up the score to blow away the KPI. Hopefully, that impressive performance would show they’ve outgrown their current position. What often happens is this stellar performance, beyond any wildest dream, often has a side effect.

I realize it’s tough to even suggest that the components of a strategy should be changed in flight (between performance review periods) when conditions change. I’m sure we’ve all made a few snide remarks about a re-org of some sort that sometimes seems like just action for the sake of doing something (I think sometimes it is). There is an inertia that builds up, and every time we adjust course there are the costs of context-switching and taking at least one step back to take two steps forward. But if a theory (a correlation) ceases to be true, we can continue to plug away anyway, attempt to destroy the cause of the change, or blend in with the change.

Continuing to plug away at a strategy in the face of empirical evidence that a correlation no long is valid is a cousin of the “definition of insanity” (Einstein’s doing the same thing over and over and expecting different results). Sometimes we can catch an agent of adverse change (at least adverse for us) early on and nip it in the bud; or catch it late and destroy it at great expense. But very often, victoriously vanquishing the enemy only delays the inevitable. What makes most sense is for us to embrace the change, trying as best as we can to guide the change to our favor.

I want to be clear that by changing the strategy, I don’t mean making wholesale changes. I mean at least starting incrementally by identifying the KPIs for which a “good” status indicator no longer correlate to desired outcomes. This is as opposed to just finding KPIs exhibiting “bad” status and manipulating your activity such that the status becomes good. That good status means nothing if that goodness doesn’t correlate to success anymore. Hopefully, like a good software design, components of the strategy are loosely coupled so that we can replace isolated poorly performing business processes with superior business processes with minimal side-effects.

Note too that it’s not just a weakening of a correlation that could hamper the success of a strategy. Even a strengthening of a correlation could have adverse system-wide effects. For example, if gasoline were suddenly reformulated to convert chemical energy into heat more efficiently, that’s a good thing, but many components of a car would need to be redesigned to handle that change.

The Effect Correlation Score

This Effect Correlation Score of a KPI measures the relationship between a cause and its intended effect. It would most simply be implemented as the linear relationship Pearson Product-Moment Correlation Coefficient (PPMCC) described in my prior blog, Find and Measure Relationships in Your OLAP Cube. The PPMCC is the same calculation used by Excel and the CORRELATION MDX key word. My blog describes what is required for such a calculation, although it describes the implementation in MDX, not SQL. (I just Googled sql pearson correlation and saw many good sites demonstrating the PPMCC with SQL.)

Of course, the PPMCC is just a simple implementation of an Effect Correlation Score. Another of my blogs, Nonlinear Regression MDX, describes a score for more complicated but more accurate non-linear relationships. For very complex relationships, we could implement involve predictive analytics models (regressions are just very simple Predictive Analytics models), particularly the probability of a decision tree query.

The ECS should result in a value ranging from -1 through +1. That is the range of the PPMCC and we could use the 0 through 1 probability of a predictive analytics model. This range of -1 to +1 is the same conventional figures of a status and trend. For the ECS, -1 should mean there is an inverse correlation (one goes up, the other goes down). +1 should mean there is a direct correlation (they go up and down together). Anything in between are grades of correlation, particularly the value of 0 which means no correlation at all.

The ECS in the Performance Management Planning Process

When we’re sitting with our team in a conference room in front of a whiteboard for a Performance Management strategy planning session, we develop our strategies of webs of arrows pointing from one box (a cause) to another box (an effect) where each of those arrows implies an existing correlation. Again, happier employees leads to higher product quality. Or better upstream testing leads to lower costs.

Normally during the PM planning process, at some point before setting the strategy in stone, we need to determine how we can measure the values of each cause and each effect (KPI). If we can’t measure our actions, we’re running blindly. We also need to define what good and bad performance means (the status of the KPI) and the targets (which are really the goals of sub-objectives).

(As a side-note, some effects are causes as well – ex, higher product quality leads to better customer satisfaction which leads to higher return customers which … etc, etc.)

For the ECS, what we should be sure to do is work on pairs of related KPIs as opposed to determining how to obtain the values of all KPIs first, then obtain the status of all KPIs, etc. For each pair of related KPIs, we should validate that there is indeed a correlation between them (the cause and effect). Because if there isn’t a correlation, our logical reasoning down the line could be faulty.

The correlation could be tested between the status or direct value of the KPIs. I tend to think it makes more sense to test the correlation between the statuses of the two KPIs since that is semantically richer. However, the status could be calculated such that the correlation (or lack of correlation) could be misleading. So I usually run a correlation on both to see which makes more sense.

A problem could be that the values for KPI statuses over time are not recoverable. For example, there may not be a record of target values (usually a component of a KPI status calculation) for sales for all time periods (months, quarters), just the latest. But KPI values such as sales over time should always be recoverable to run a correlation.

Whichever is chosen, this is validation of at least part of the overall theory. For the future (during the course of the execution of the strategy), we would want to periodically check if the correlation continues to exist, which is the main idea of this blog.

Where Does the ECS Go?

On a Scorecard, there is one row for each KPI. Each row shows the KPI name, the value, the target, and a cute icon for the status and trend. They fit nicely one a single row because there is a one to one relationship between a KPI and its value, target, status, and trend. In reality though, most causes effect more than one thing.

However, in a Performance Management system, there is probably one primary effect for each cause. If we want to preserve the nice matrix display of one KPI per row, we could select the primary effect and add to each row, the name of the primary effect and a cute icon for the ECS. For any actions we take, there is usually a primary reason. For example, I know right away that I need to earn money to pay the bills, although there are many other minor reasons.

I like to denote strong correlations (ECS value closer to 1) with green, no correlation (ECS value nearer to 0) with cyan, and an inverse correlation (closer to -1, one value goes up, the other goes down) with blue.

Performance Considerations

This PPMCC calculation should be not much more expensive that a typical trend calculation commonly implemented as a slope of recent values. That would require the SUMs of many facts over say the past few time periods (days, weeks, months, quarters, years, etc) and a few squares and square roots of this and that. The PPMCC is really just a step beyond calculating a slope. In a relational database, the heavy work will be the IO involved in loading the same data required for the trend (slope), but will just do a few more calculations. Predictive Analytics models should be even quicker since they are really distilled into a set of rules (from a large amount of data) as opposed to calculating off that raw data itself.

Which KPI relationships Needs an ECS?

Implementing these correlation scores is not as unselective as implementing the PPMCC calculation between KPIs. For example, not all KPIs would benefit from an ECS. For that matter, not all KPIs require a target, status, or trend. In fact, an ECS for some KPIs could be completely meaningless or worse yet, misleading. For example, it would not be entirely true to say that a great status for the revenue KPI results in a correspondingly improved status for the profit KPI. That’s because revenue is just one factor of profit.

An ECS doesn’t need to be constantly monitored either. At the least, we should periodically validate that relationships incorporated into our strategies are still valid and haven’t slipped into mythical status. We need to also ensure that the granularity and seasonality are taken into account. Correlations may not exist at a day to day level but may appear at the monthly level. There may be seasonality involved that do not exist at certain times of the years.

Correlations are Not Silver Bullets

If correlations such as those found in calculations like PPMCC values or in data mining models worked very well in reality, it would certainly be a different world. After all I wrote above, I do need to include a disclaimer about the folly careless analysis of data. There are so many ways to misinterpret correlations, so many reasons why strong correlations could be meaningless, how a higher level lack of a correlation could mask a lower-level one of great consequence.

Nonetheless, when someone at work tells me something such as an MBA does or does not equate to higher project management success, I know I can run a few queries to help prove or disprove such a statement. I know I need to think through any possible misinterpretation. At some point I need to either be satisfied that the numbers do back up the statement, go against it, or it’s still inconclusive. If I do come to the conclusion that there is a positive correlation, then the possession of an MBA goes into the plus column for my strategy to hire superior project managers. My point is to continue checking that correlation to ensure that it is true. Otherwise, as I mentioned above, our actions become a case of the definition of insanity.

Other KPI Ideas

There are many aspects of KPIs that are not sufficiently addressed beyond trends, status, and the ECS. For example, I mentioned earlier that trends normally show the recent activity allowing us to infer a future state of the KPI. But a predicted trend would help us better infer that future state. In fact, multiple future trends dependent on a range of possible inputs would be even better. Another example is that statuses could be different from different points of view. Could a sales KPI drastically exceeding expectations be the same for the sales manager and the inventory manager who must deal with the unexpected demand? Would it help for a sales manager to know that the sales KPI for the inventory manager is bad?

By the way, Map Rock does a lot of this stuff …

Posted in BI Development | Tagged , , , , | Leave a comment

Cluster Drift

A deficiency I notice in practically every implementation of clustering (segmentation) is the snapshot mentality. For example, a vendor of a product would segment their customers in an attempt to isolate the ones who would be most likely to buy their product. This captures a snapshot of the groups of similar customers right now, but it doesn’t capture how the groupings and salient points of similarity have changed over time. It’s change that triggers action.

The clustering technique of data mining mimics our brains’ constant categorization process. Whenever we encounter something in life, such as we’re about to be caught in the rain, our brains notice sets of characteristics (dark clouds, far from home) and tries as hard as it can to match it to something encountered in the past, even if that often means pounding a square peg into a round hole. Once we find the closest match to what we are currently seeing to a group of similar phenomenon from the past, we can reasonably think that what is associated to that past phenomenon applies to this current one. If those associations do hold true, that grouping gains strength, otherwise, we formulate a new grouping. That’s learning, the basis of our ability to apply metaphor.

Figure 1 illustrates one of the simplest forms of clustering, the Magic Quadrant. Here we see a bubble chart of countries in 2006 clustered on two dimensions – the barrels of oil consumed per 1000 people per year and the GDP/Capita. The size of the bubble actually conveys a third dimension of population.

USA and Mexico clusters.
Figure 1 – Magic Quadrants are a simple clustering technique.

The very simple bubble chart clusters countries into four distinct groups. On the lower-left corner are the countries with low GDP/Capita and low oil consumption. The huge circle in the lower-left is actually two countries, China and India (the huge populations). The upper right are the high GDP/Capita, high oil consuming countries. The relatively large circle is the USA. Generally, we attempt to cluster (categorize) for a reason. In this case, the primo quadrant is the lower-right, high GPD/Capita, low oil consumption.

Guess that country with the relatively large circle, the higher GDP/Capita, lower oil consumption?

I love Magic Quadrants on reports as a graphic. In fact, when implementing BI I’ve trained myself to think in terms of Magic Quadrants as my default view as opposed to a bar chart or line graph. But they are limited to two dimensions even though we could further the clustering to 3D taking into account the bubble size (population). We can easily see the lower-right is easily further clustered into those three larger circles and the other small ones.

However, the real world is more complex than that and a magic quadrant quickly becomes an inadequate tool for helping us to perform effective categorization. We need to consider many more factors when determining what something such as a potential customer or competitor is like. Arrays of magic quadrants help, but clustering will works through the manually process automatically – well, once you’re able to provide the necessary factors.

Clustering as it’s usually implemented these days discovers these categories from scratch, usually as part of the lowest hanging fruit application of target marketing. In those implementations of today, the clustering disregards the aspect of time, forgetting the categories of the past as though they are completely obsolete.  Or more importantly, that the evolution of clusters doesn’t have any meaning or value.

People change over time. They get married, have children, are promoted at work, suffer injuries and disease, become weary from work, change careers, become religious or unreligious, heath conscious, become older and wiser or cynical. As with any change, these changes are driven by each person’s sequence of life events. Sometimes there is a typical progression, sometimes there are unexpected setbacks or windfalls. Companies and even countries go through changes as well.

It may seem all that matters is what the customer is right now. The past is the past. What does it matter how we got here, we’re here? That’s like saying a photo of two baseballs crossing paths at roughly the same place in an instant of time is all that is important. But the past dictates the future. The planes have a trajectory that represent more value than the simple fact that they are in relatively close proximity. Along similar paths, customers are more likely to behave similarly.

Normally, this trajectory problem (acknowledgment that things change) is handled by “refreshing” the clusters periodically; for example, before a targeted marketing campaign. Refreshing the clusters using the current state of the customers means we know what they are right now. This may work well for an immediate marketing campaign, but what if I’m attempting to develop a product that will take months to get to market. Could I predict my audience for that time?

Sometimes change is not progressive. For example, we become completely different people in the presence of parents, co-workers, or a club where we have authority. In this case, change is dependent upon circumstance, not just a natural progression driven by time.

Whether changes are predictably progressive or not, people, countries, and companies change, thus how they react to our efforts to service or engage them change as well. So when I apply clustering during a predictive analytics engagement, I take the clustering down to a deeper level than just the entities we’re clustering. For example, how do these customers behave during recessions or boom times, or under fierce competition or where they are secure? How do people behave on vacation versus being at work?

For this blog, I simply created clusters for each country for each year. Instead of simply clustering countries using their current states, I consider each country during each year from 1971 through 2006 as separate cases. What I’m trying to identify are the levels of “development” among the countries. How do countries progress from tyranny or poverty to democracy and/or wealth? The cluster model incorporates four measures that seem like plausible measures of such terms as freedom, poverty, and wealth:

  • Enrollment Rate is the percentage of school-aged children enrolled in grade school. Education is certainly a measure of wealth, whether it causes it or is a result of it.
  • GDP per Capita has two forms. “GDP Capita” is the GDP divided by the population of the country for each year. I chose to use a figure per capita to eliminate the factor of the sheer size of a country. For example, the USA will would always be in a cluster of its own. Part of what I want to see what countries live at a level of the US citizens.
  • The other “GDP/Capita” figure is “GDP/Capita Rate”, the change in GDP/Capita from the average of the past four years to the current year. Moods are very different when things are improving or degrading, even if things are currently good or bad.

Change, Categorization, Correlation and Map Rock

The clusters and graphs shown in this blog were created from a hodge-podge of SQL, SSAS data mining models, DMX, and Excel charts. I call this technique Cluster Drift. Cluster Drift is actually one of the techniques inherent to Map Rock. Map Rock’s core theme is that:

  • Change is the primary trigger that gets our brains’ attention. Change is also used as a factor in the same sense as a key diagnostic question is “What has changed since … ?”
  • Categorization is recognition. We recognize things by categorizing them into things from the past. Categorization gives us the ability to create a fuzziness about things. We then don’t require direct hits in order to consider something recognized.
  • Correlation. Once we are alerted to something through change, recognize what is present through categorization, we then attempt to notice when these things or just a subset of these things were present together in the past. If these things did happen together in the past, it’s reasonable to consider that anything else associated back then as well applies here.

I chose to present this technique without Map Rock as I like to present techniques I’ve developed and implemented in its raw form. “Raw” means that my data prep involved importing data into SQL Server, performing a significant amount of transforms, generating data mining models until I found a set that yielded somewhat coherent results, and ran the models and pasted the results into Excel for analysis.

Data Mining Disclaimer

Before continuing, I need to digress and present this disclaimer, especially with this blog since I’m writing it during the time that all the NSA data mining crap is front-page news. So at this time, I want to be extra careful about anything said about data mining.

This blog is not about presenting any findings. It is about presenting a predictive analytics technique for which I’ve had significant success. Unfortunately, as predictive analytics provides a strategic advantage to my customers, I’m never at liberty to present the real data and results of my customers. Thus, the data I use here is downloaded from “free” sites and I really have no way (or time that I would invest if this blog were about the findings) to adequately validate it for the purposes of research. I certainly would have wanted to share both a data mining technique and research results, but there is only so much time. My hope is that the data is at least coherent enough to communicate this technique.

This is a good a time as any to mention that the data shown here only goes to 2008 because the quality of the data beyond 2008 was surprisingly just too strange.

Additionally, and I should include this paragraph in every blog I do on predictive analytics, predictive analytics should never become a substitute for thinking. It should be considered something like glasses or contact lenses, a tool that enhances our vision, nothing more. It is true that numbers don’t lie. But the human brain exists to resolve problems and with the pressures we are all under, we’re only too eager to buy into something that seems even remotely plausible so we can move on to resolving another problem on our plates.

There is so much data out there that I’m positive we could throw together a set of graphs to support anything conceivable. If one were to switch between Fox News and MSNBC a few times, one would immediately see what I mean. Therefore, the biggest discipline for helpful predictive analytics is the ability to not jump to conclusions.

Data Mining is not easy. It is extremely difficult. But most people who ask me about it think it is easy for the wrong reasons and hard for the wrong reasons. Many think it’s easy because they only know about the data gathering part and building the data mining models. They aren’t really aware of the less glamorous (doesn’t have a cool market name like “Big Data”), extremely tedious “data prep” and validation stages. Validation is usually given short-shrift. In fact it’s a much more developed concept in Map Rock where validation goes beyond testing the results of a test set, but providing tools to “stress a correlation” discovered through analysis. Building the models is by far the easiest part, not much more than executing an algorithm. Procuring the data is still very difficult, but the tools and techniques are well-developed.

So focus on the notion that change is the key to engaging our attention and that change requires time. The data is just to support our conversation here. Phew, that disclaimer rivaled any big pharma commercial.

Cluster Drift

Back to my presentation. Using the Analysis Services Cluster algorithm I generated a cluster model of ten clusters from data that includes about 120 countries from 1971 through 2006. As with most data mining, the vast majority of the results were uninteresting, very obvious, and sometimes confusing or contradictory. After perusing the results, I chose to focus on comparing the USA and Mexico for this blog since the resultant clustering of the two countries illustrate my main point.

Figure 2 shows a sample of the clusters. The selected clusters are the ones that the USA and Mexico belonged to at some points from 1971 through 2006.

USA and Mexico clusters.
Figure 2 – Clusters of the USA and Mexico. 

It’s easy to see the variance of values for each measure (look at the values for each measure from left to right). For example, Life Expectancy varies among the clusters in a range of about 30 years (51 through 80) and Enrollment Rate from about 37% to about 88%. However, the clusters aren’t too interesting in that for each cluster, the measures tend to go up and down together. It would have been very interesting to find a cluster where Life Expectancy is very high, but GDP/Capita is very low. We intuitively already know that high enrollment rate, high life expectancy, and high GDP would usually go hand in hand. Nothing new there.

Nonetheless, the clusters help measure when countries break thresholds that place them into different brackets. Figure 3 shows how the clusters of the USA and Mexico have drifted from 1971 through 2006.

USA and Mexico Cluster Drift.
Figure 3 – Cluster Drift of the USA and Mexico.

The row axis is the probability  that the USA or Mexico belongs to the color-coded cluster. Looking at the USA in 1971, we see that the USA most closely fit to Cluster 4 with about a .83 probability. What stands out about Cluster 4 is the higher GDP/Capita rate. We also see a small probability for fitting into Cluster 5 in 1971 and see that Cluster 5 trends upwards while Cluster 4 trends downwards until around 1976, the USA became more Cluster 5 than Cluster 4. The main differentiator between Cluster 4 and Cluster 5 is that for Cluster 5, the GDP/Capita is much higher, but the GDP/Capita Rate slowed a bit. During the early 1980s, we see the USA start to resemble what the USA is today, Cluster 7, firing on all cylinders.

From the Cluster Drift of the USA, notice that around 1976 and 1985, the  USA was in Cluster 5 at .7 probability. However, these were very different Cluster 5s, one on the away from the lower but rapidly growing GDP/Capita of Cluster 4 and the other giving way to very wealthy but more even GDP/Capita of Cluster 7.

Looking at Mexico, we see it starting out in Cluster 10 moving onto Cluster 2 with its higher Enrollment Rate, Life Expectancy, and GDP/Capita. Interestingly, variance of the GDP/Capita Rate of Cluster 10 is wider than it is for Cluster 2. My first thought would be the transition from Cluster 10 to Cluster 2 may hint of signs of stabilizing.

What is really interesting is that around the mid-1990s, Mexico started to fall into Cluster 4, where the USA was in 1971. And as of 2006, Cluster 5 began to trend upwards against Cluster 4. Does that observation hold in real life? Over the past ten years I’ve been to practically every corner of the USA, but to only Monterrey in Mexico. So I don’t know, but it’s precisely because I don’t directly know that I would resort to predictive analytics to arrive at a best guess from an indirect angle.

Figure 4 shows the USA against Japan. It’s interesting to see how what were the two biggest economies for a long period of time mirrored each other.

USA and Japan Cluster Drift.
Figure 4 – Cluster Drift of the USA and Japan.

Contrast Japan’s mirroring of the USA to another “Cluster 7 2006” member, Portugal in Figure 5.

Portugal Cluster Drift.
Figure 5 – Cluster Drift of Portugal.

Although the chart for Portugal suggests the people enjoy a similar quality of life to that of folks in the USA, Portugal’s grasp of Cluster 7 status isn’t very tight. It drifts between Cluster 5 and Cluster 7, whereas the USA and Japan are tightly in Cluster 7 despite Japan’s stagnant economy.

Figure 6 shows China. What is most interesting is that unlike the USA, Japan, and Mexico, from about 1975, it couldn’t fit very well (probability of say over .85) into any cluster. China certainly is unusual due to its sheer population size alone. But the USA’s portion of the world’s economy is similarly disproportionate to China’s portion of the world’s population. Yet, the cluster algorithm could fit the USA nicely into clusters much of the time.

China Cluster Drift.
Figure 6 – Cluster Drift of China.

China may be the 2nd biggest economy in the world today, but as of 2006, it hasn’t started to resemble the USA at any point as Mexico has (at least as the USA was in the 1970s – which was still very good). Remember, I didn’t include any cluster factors that speak to volume such as the GDP as itself (not per capita) or population size. China may be the 2nd largest economy today, but the GDP/Capita is still very small. Notice though in the lower right corner of the graph in Figure 6, that in 2003, Cluster 4 (as the USA was in 1971) starts to rise to a .1 probability.

Figure 7 shows the clusters China has bounced between, plus Cluster 7 (the USA in 2006) for comparison.

China clusters.
Figure 7 – China’s clusters plus Cluster 7 (current USA).

So What?

If it looks like a duck and quacks like a duck, it’s probably a duck, but there’s a chance it is a duck decoy. Predators getting prey to believe they are something they are not and prey getting predators to believe they are something they are not shaped life on Earth as it is today as much as the nature of water and the average temperature.

Whether we’re going to war, interviewing for a job, or making our pitch to a customer, our odds for success are greatly improved if we know more not just about the customers, but effectively differentiating from the competitors as well. If you’re a software vendor, you would want to know things such as:

  • How is the shape of my client-base changing?
  • How is the shape of my clients’ clients changing?
  • What are my competitors doing to outflank me?
  • What are the warning signs of the “death” of my customer?

For a software vendor, Cluster Drift is even more useful if the death and birth of the domain served is high, such as with restaurants. Could we answer something like, “Are we the choice software vendor for is now the walking dead?” If we found the proper clustering model, we could study the cluster drift of restaurants about to die and either help prevent it or move on to something else.

At the end of the day, what we want is intelligence on the entities we deal with. We want to know the nature of our relationship with these entities and how they are changing so we don’t interact inappropriately. Therefore, our cluster models must consist of measures that characterize those relationships. This includes things such as the number and type of contacts, number of successful and failed encounters, and volume of business; all divided by years so we can study the change in relationship.

The most compelling predictive analytics are implemented within a Darwinian meritocracy where business is a competition. For example, one of the big criticisms I’ve heard about Microsoft as a software company is that there are many groups working on the same thing. One example would be something like the workflow aspects of SSIS, BizTalk, Workflow Foundation, and even Visio developed redundantly throughout. It is true that it isn’t an optimal way to run a business, but it’s also beautifully Darwinian as well – attacking the same problem from different angles where each group is fighting for it’s own survival, eventually converging into a richer solution than had one group been charged with the problem. I know that sounds like heresy these days, but for a high-tech business that sort of Darwinian aspect seems to be a defining characteristic of an innovative entity, not a fault.

More Data Mining Caveats

It’s important to keep in mind as well that these clusters alone do not define a country or whatever is being clustered. This sounds obvious, but it’s fairly easy to get caught up in the fun of playing with these results. In reality, in our brains we cluster things from many angles. Most things operating in real life are too complex to be effectively captured via a single cluster. Some of these clusters are valuable, some not, either because they were misinterpreted from the start or have become obsolete.

The reason I love data mining is that it should force us to honestly reflect how we think and what drives us. But there is also a huge audience looking to data mining as alchemy; magic math that will reveal the secrets of the world providing riches grossly disproportionate to the effort invested (like winning the lottery). That magic math does indeed exist. The problem is that when we live in a world where intelligent creatures do not passively flow through life like trees and insects, the math will not calculate wins for everyone. Meaning, there is no way everyone can be the winner. Can a lion and gazelle both win? Life on Earth is based upon competition, an endless sequence of actions and changes.

Posted in Cutting-Edge Business Intelligence, Data Mining and Predictive Analytics | Tagged , , , | 1 Comment

Is OLAP Terminally Ill?

Someone told me yesterday that “OLAP is dead”. “Everyone is choosing tabular/in-memory.” I know it’s not dead, maybe at least sick. But did I underestimate the time of the tipping point, the curve in the hockey stick, where the vast majority of users will “sensibly choose” the tabular/in-memory option over OLAP?

I realize some, including me, think this topic is beaten to death. From the point of view that OLAP is my bread and butter, my top skill (I should have become a cop like my dad wanted), of course I took it to the heart, and take things (including “official” word from MSFT) with a grain of salt. But I also realize the person who told me this is very bright and knows the market. So I had to take a time-out today to revisit this issue as a reality check on my own professional strategy; a good thing to do every now and then.

When I first became aware of the OLAP is dead controversy a little over two years ago, I wasn’t too afraid of this since 256 GB of RAM was still really high-end. Today, 2 TB is “really high-end” (a few Moore’s Law iterations), well beyond the size of all but a few OLAP cubes I’ve dealt with (not even considering in-memory compression!).  And there were a couple of issues I still had not fully digested at that time.

One of those issues was not fully appreciating the value and genius of the in-memory compression. At first, I was only thinking that RAM with no IO is just simply faster. But the compression/decompression cost that occurs in the CPUs, which results in a whole lot more CPU utilization, isn’t really much of a cost since those cores were under-utilized anyway. Another was the volatility issue of RAM. At the time solid state memory was still fringe and my thought was that even though volatility wouldn’t be much of an issue in the read-only BI world, but would be an issue in the OLTP world. Well, that doesn’t seem to be the case with Hekaton.

After thinking for much of the night, here are two key questions I came up with that will determine whether OLAP (specifically SQL Server Analysis Services OLAP) will die:

  1. Is the question really more will hard drives (the kind we use today with the spinning wheels and all those moving parts) become obsolete? RAM and/or flash could neutrailize all the advantages of disks (cheaper, bigger, non-volatile) relatively soon.
  2. Will OLAP become minor enough in terms of utilization and product pull-through that Microsoft will no longer support a dev team? I can quickly think of a few Microsoft products with a strong but relatively small following that just didn’t warrant an infrastructure and were dumped.

An important thing to keep in mind is that there are really two separate issues. One is the underlying structures, OLAP versus in-memory, and tabular versus multi-dimensional. The first issue, the underlying structures, is a far more powerful argument for the death of OLAP. The underlying structure really will be seamless to the end-user and it won’t require any guru-level people to implement properly, messing with all those IO-related options.

However, I still don’t completely buy the “tabular is easier to understand than multi-dimensional” argument. I buy it to the point that, yes, it is true, but I don’t think this is the way it should be. My feeling is that the multi-dimensional concepts encapsulated in MDX and OLAP are more along the lines of how we think than what is encapsulated with SQL and relational databases. What comes to mind is the many times I’ve engaged a customer with “thousands of reports” that were really variations of a couple dozen and were mostly replaced with a cube or two.

As a side note, one exercise I use to demonstrate the elegance of MDX is to think about the syntax of how Excel handles multi-dimensions. Excel is multi-dimension, but just two dimensions. With a cap on dimensionality, it’s easy to use the A1 (column A, row 1) syntax. But what about three dimensions? A sheet (Sheet1$A1). Four dimensions? A different xlsx document. Five? A different directory. That’s not at all elegant. But MDX elegantly “scales” in the number of dimensions; it looks the same from zero through 128 dimensions.

The tabular model reminds me of when I started my OLAP career in 1998 as a developer on the “OLAP Services” (SQL Server 7.0 version of Analysis Services) team at Microsoft. OLAP for SQL Server 7.0 was really the just core OLAP, no frills, just strong hierarchies and aggregations. It was very easy to understand, but users quickly hit walls with it. That reminds me of how VB was so easy to learn. One could learn to build pretty good applications quickly, but would run into problems venturing beyond the 80/20 point. Eventually .NET (C#/VB.NET) came along, still relatively easy to use (compared to C++), but still a quantum leap in complexity. For OLAP Services, that was SQL Server 2005 Analysis Services with the weak hierarchies, many to many relationships, MDX Script, KPIs, etc.

I guess what I’m saying is this is a case of taking a step backwards to take two steps forward. The spotlight (tabular) isn’t currently on the high-end where I normally make my living. However, it doesn’t mean there isn’t a high-end. The high-end as we know it today (OLAP) will eventually die or at least severely morph, but requirements of yet unknown sorts on the low-end will push the complexity back up. How will Big Data affect the kinds of analysis that are done? Will 2 TB of RAM then be sufficient for the “masses”?

At the moment, I do believe that in terms of raw new BI implementations, tabular is giving a whooping to OLAP. It should since the idea is to expand the accessibility of BI to a much broader audience. I’ve lived through the rise of Windows 3.1 and the dot-com crash. This is a minor disruption; it’s not like I haven’t begun moving on years ago – in fact, skill-wise, I learned to always to be moving on to some extent.

BTW, he also told me that “no one makes money on Big Data and that Predictive Analytics is limited to one or two people”. Those are in fact the two skills I’ve been shifting towards the past few years in light of the sickness of OLAP. While I really don’t know about the former claim (and would find it short-sighted to base strategy on that), I do have a couple of opinions on the latter:

cube-small

Figure 1 – Even my cat wonders whether cubes could be on the way out.

Posted in SQL Server Analysis Services | Tagged | 12 Comments

Map Rock Problem Statement – Part 5 of 5

This is Part 5 of 5, the last of the Map Rock Problem Statement series. Part 4 closed with a discussion on the limitations of logic, in particular how the quality of our decisions are limited to whatever information we have available to us. No matter how smart we are, what we know is a drop in the bucket. We’ll often encounter situations for which an answer “does not compute” due to this imperfect information. I close this series with a short essay on Imagination. The previous installments can be viewed here:

  • Part 1 – Preface to the 5-part series.
  • Part 2 –  I describe Map Rock’s target audience and the primary business scenario for Version 1. It is not just a tool for quants, wonks, and power-users.
  • Part 3 – We delve into a high-level description of the Map Rock software application, where it fits in the current BI framework, and how it differentiates from existing and emerging technologies. This is really the meat of the series.
  • Part 4 – We explore strategy, complexity, competition, and the limitations of logic.

Imagination, the Key to Human Success

The more we know, the more we know we don’t know. That is the thinker’s equivalent of the “Serenity Prayer”. I try to remember that every day, but often fail to in the heat of troubleshooting a brainteaser of a bug or performance issue in my applications. My Western, Aristotelian, reductionist science upbringing naturally forces me towards employing increasingly intense logic and tighter grasp over the situation, to no success. With billions of my brain’s “CPUs” pegged at 100% (our brains are massively parallel), I finally resort to the counter-intuitive action of taking a long walk or going home to sleep while the problem is still very much there wreaking havoc on the world.

However, as hundreds of thousands of clusters of neurons in my brain wind down their Gatling Gun fire, the very faint signal of a “weak tie” still lingers. It is a faded event from my childhood so unrelated to this situation and illogical in this context that it’s “sensible” to dismiss as ludicrous. But I still playfully think, “What if … ?” I begin to imagine relationships that I had never thought of or learned. These imagined relationships plugged into the web of what I know would result in a diagnosis of my problem, from which I can then engineer a solution! I run back into the building, to my laptop, to test if these imagined relationships do indeed exist. After resolving the problem, with my room service Mobley Burger as a reward, I marvel again at how the reliance on just pure logic applied to what I know I know (at least what is at the forefront of my thoughts) failed me.

The weak memory served as a metaphor, a template from which I can creatively substitute similar pieces with their corresponding current pieces or even add or delete pieces. Old ideas may not apply in exactly the same way in a new situation. They are a starting point towards a solution. Meaning, that old memory is not a “direct hit” recognition of what we are currently experiencing. It’s technically a mistake, like how I keep mistaking David Proval for Al Pacino. Our brain has to support us in a complex world where nothing is certain. But it’s these fuzzy, “close, but no cigar” recognitions that are the basis for imagination, our primary weapon. 

It’s our ability to imagine that raises us above all other creatures on Earth in terms of our ability to manipulate the environment to our liking. The options our brains are capable of producing aren’t limited to hard-wired instinct or what relatively little we’ve experienced (in comparison to all there is to experience). Being bi-pedal, having opposable thumbs and a large brain are all just characters within the bigger story of how our imagination propelled humanity to being the current lords of this planet.

If something hasn’t yet happened, our knee-jerk reaction is that we believe it could never happen. Our brain doesn’t have that relationship, so any inferences we make are computed without that relationship. This is the limitation of logic. Logical conclusions are based on the facts that are available. The problem is that the world is a complex system and thus we never really have all of the facts required to make fool-proof predictions.

Imagination means we are able to draw a relationship between two or more things that were never explicitly stated to us or so weakly connected that we normally disregard it as a silly thought or even paranoia. This is how humans have our cake and eat it too. We have the power to apply logic through symbolic thinking, enabling us to manipulate the environment to our benefit (within a limitation of space and time), yet overcome the limitations of logic in a complex world.

When we think, we do more than just process values in a Bayesian manner. We also audition ideas by playing what-if experiments in our mind. We safely experiment in our heads before committing to a physically irreversible action. We can change our mind when we suspect that the cost of being wrong is too great. This ability means that the results of our thinking are not as mechanically pre-ordained as that of any purely logical device such as a machine, rigid equation, or set of best practices. Imagination breaks us out of the rather Newtonian condition of reptiles, birds, and insects that are lives riding out a deterministic fate.

Armed with imagination, each human is a force of the universe, cognizant of our own deaths, thus not really willing to settle for just the greater good of the species – even though that would be admirable. We are capable of subverting the laws of physics through our contraptions. We can imagine an outcome other than what the vectors of each object dictate. We can manipulate other objects towards a more desirable outcome.

However, imagination is being expunged from our feature-list as a flaw. We are goaded into “best practices” and chastised for not following procedure, even if that insubordination resulted in a superior outcome. Imagination is seen as a childish characteristic, something that should be out-grown.

Why is this? Going back to the beginning of the Problem Statement series, the world is spinning faster and faster. There are a lot more people with a sufficient amount of wealth, information about all sorts of things, and wanting more things faster. This equates to chaos. Now, Map Rock isn’t about how to fix this chaos. I honestly don’t have an answer for that. What I am offering is how to better compete in this chaos.

Map Rock addresses imagination by guiding the user through brainstorming activity, integration of rules from many sources (particularly Predictive Analytics models), and the SCL mechanism for inference from multiple perspectives.

The Will to Compete

Thinking is hard and usually associated with some sort of pain or problem we need to address. We engage the act of thinking to resolve problems. Similarly, hunger is painful. Our ancestors prior to agriculture walked many miles and risked their lives engaging much larger and stronger animals to alleviate the pain of hunger. If hunger weren’t unpleasant, I’d probably be thin since I wouldn’t care to go through the trouble of shopping and cooking. We’d often rather make ourselves believe that a solution currently in play and what our beliefs resolve to will in the end succeed than face a string of burdensome fire drills – ie keep telling yourself that, eventually you’ll believe it.

Thankfully, we can invent or learn a procedure to easily alleviate pain without needing to re-invent it the next time. These procedures (tips and tricks, best practices) can be followed to alleviate pain with a chance for success ranging from good to great without needing to engage thinking. These procedures are incorporated into our lives, programmed into our psyche. Any time someone tries something different, it can upset the delicate machine that is serving us well, forcing us to think.

But here is the kicker. Here are two major thoughts upon which life on Earth has evolved:

  1. Every creature makes its living devouring other creatures and every creature tries to reproduce (or grow) as much as it can, exerting constant pressures driving other things to take action to survive. “Grow or Die” as they say.
  2. The world is a complex system with so many moving parts that predicting anything beyond a very limited volume of space and time is at best unreliable.

Putting aside the bacteria that consume raw minerals off rocks and dirt, every creature hunts and devours another creature. We are all both predator and prey, the hunters and the hunted. There are two sides to our intellectual lives; how to be better hunters and how to better avoid our hunters. Add to that as well that there is also the competitor relationship, peers in contention for the same resources. This predator and prey relationship, both trying to outdo each other, means that for life to keep going, creatures must be able to evolve to adapt to new tactics applied by the “enemy”. Creatures all try to over-reproduce because there is a constant pressure upon it from its predators as well as peer-competitors.

We reactively adapt to change (forced upon us by our predators and competitors) and we proactively engineer and mold conditions to our favor. Players at the top are at the top because conditions favor their strategies. The conditions were either already there yet unseen or they were imagined and forged into place. In either case, the company at the top intuitively defends these current conditions since different conditions may not favor them anymore. They don’t know for sure if different conditions would indeed favor them because the world is too complex to really predict such things. That’s why we still run the experiments and we still “play the game” even if we’re pretty sure about who will win.

We can’t just sit idly by not doing anything because we don’t know for sure. We can’t stubbornly hold out for a “just one number” answer or refuse to take “It depends …” as an answer. We take a best guess by running our thought experiments safely in our head, weeding out scenarios with the possibility for unacceptable outcomes, before we decide to take a physically irreversible action in the real world. Even if we’re wrong, meaning during the execution of our action we recognize a negative intended or unintended condition is developing, our second line of defense is that we can make on the fly adjustments.

The thing is, an action isn’t just an isolated action. It sets into motion lines of series of events that change things forever, even if just subtly at first. And the complexity of things make it impossible to predict what the results will be even just a few steps down the line.

Competition Is a Good Thing

I feel we have more than just a right to compete or to defend ourselves against “aggression” but an obligation as creatures of this planet in order to keep driving evolution. If we all simply “go with the flow”, offering no resistance to a force (of competition or even aggression), eventually everything will settle into a dull static pattern. It’s the resistance offered by the countless interacting things in the world that result in the dynamic system we call life, the action we live in. In note #1, I mention bowing before judo matches.

I get this feeling competition and aggression are dirty words these days. There are strongly driven criminals (of many sorts) who have been publicized for cheating in one way or another resulting in harm to others. There are bullies. I certainly don’t like cheaters as those minority of people impose a cost on all of us as we all go through the inconveniences of the processes to hinder their efforts (like TSA screenings and passwords on all web sites we use). But we cannot lump the sportsman-like competition of commerce in with the criminals and bullies.

Even when we do say “competition”, I think we’re trained to think only of the ridiculous sort on reality TV such as American Idol and Shark Tank. They are competitions, but what I’d think of as games where the rules are pretty set. To me, a game is something like that, where set rules very much dominate and there is specific criteria. Surely, landing a contract or a job is like one of those shows in that the “judges” (customer or employer) as looking for specific qualities and the winner will be the one to fit those traits the best. We don’t think of the sort of competition that goes hand in hand with a stress that moves things in a different direction; avoiding that dull, lifeless pattern.

What I’m trying to get at (dancing around so as not to sound paranoid), as tongue-in-cheek as I can is, yes, “they” are out to get us. Life is an eternal struggle between predator and prey – and most creatures are both to some extent. It’s usually “nothing personal, just business”, as they say. Everyone (every creature) is driven to survive, which in turn means seeking resources which are both consumed by and taken from the creature. This drive to survive is the churning, live action that fuels evolution.  Problems, mostly some form of contention for resources, pops up at every turn. Humans are problem solvers, better at it than anything else on the planet except nature herself – who will still kick our ass in the end.

Keep in mind that “evolution” doesn’t necessarily imply “improvement”, that things are now superior to how it used to be. For example, plants that made it across oceans to the sterile lands of Hawaii initially had thorns, but lost them as there was no longer any need for them in an environment devoid of reptiles and mammals. Is it better or worse? Neither.  Evolution means to adapt to changing conditions. In the short term, adapting may usually seem like improvement since we are then more comfortable after the adaptation.

This mechanism of evolution, simultaneous destruction and creation, is the reason why increasing complexity doesn’t completely destroy our planet. Our planet takes hits, but the force of evolution eventually will heal the wound and things will be vibrant again, even though things will not heal exactly as it was before. For me, the big lesson is that neither the extreme of resistance to change nor the passivity of going with the flow (no resistance to change) are good. The secret is to blend in with the system, a collection of relationships, in a jujitsu-like manner.

Map Rock’s central unit is strategy. Strategies are to Map Rock as DNA is to the various forms of life. It is through designed strategies by which humans excel. For the most part though, strategies don’t come out of a person’s head fully-baked. It involves a great deal of trial and error and an ability to recover from error. This is what Map Rock is all about.

Conclusion

From the heart, my primary motivation for developing Map Rock is to fight for “humanness” as the variance from our lives is efficiently and systematically purged in the name of optimization. But what else can we do? The way of life as it was a couple hundred years ago with only a few hundred million people isn’t scalable to seven billion. We need to be more efficient and sacrifice some things. However, as I mention whenever I’ve previously described two sides of a coin, neither is “bad”.

The way we think reflects both our unique sets of individual experiences through our lives as well as “harder-coded” functions we more or less share, which served us well when we were still living in caves. Flaws and all, our way of thinking has done pretty well for us. Our imperfections are not just quaint or nostalgic. It’s these imperfections that break the tyranny of myopic logic..

Coming Up:

  • Map Rock Proof of Concept – This blog, following the Problem Statement series, will describe how to assess the need for Map Rock, readiness, a demo, and what a proof-of-concept could look like.

Notes:

  1. I didn’t fully understand bowing to our opponent before a judo match until I was much older. I had taken it as simply a polite gesture or maybe even just telling each other it’s nothing personal. But it’s also a promise to give you my best effort and in all sincerity ask that you do the same so that we do not waste each other’s time. In fact, poor practice doesn’t just waste your time, it makes you worse.
Posted in Map Rock | Leave a comment

Map Rock Problem Statement – Part 4 of 5

This is Part 4 of 5 of the Map Rock Problem Statement. Strategy, complexity, competition, and the limitations of logic make up the soup that lead to humans being as smart as we are the way that we are. We’ve obviously done very well for ourselves. However, I feel there is an over-emphasis on speed, simplicity, and control that will essentially lead us to lose these “powers”. The previous installments can be viewed here:

  • Part 1 – Preface to the 5-part series.
  • Part 2 –  I describe Map Rock’s target audience and the primary business scenario for Version 1. It is not just a tool for quants, wonks, and power-users.
  • Part 3 – We delve into a high-level description of the Map Rock software application, where it fits in the current BI framework, and how it differentiates from existing and emerging technologies. This is really the meat of the series.

Map Rock’s Main Approach

We live in a complex world powered by the relentless struggle for survival of all things at all levels (individual, herd/tribe/country, species), each following relatively simple rules, with no top-down control. However, we humans have an ability to manipulate our surroundings to our liking, at least in the short-term (seconds to days), by applying logic which works within restricted space and time. In the moderate term (months to years), we can have our way to a lesser extent through the development and implementation of malleable strategies. Beyond the timeframe of a couple of years, even a workable prediction is useless.

Map Rock’s goal is ambitious, to say the least. As I illustrated in the Part 3 section, “How is Map Rock Different?”, it touches so many things. The biggest challenge was to avoid developing a hodge-podge, “chop suey” application analogous to the “Homer”, the car with every feature imaginable designed by Homer Simpson. My approach was to take many steps back to see the common threads tying all of those things listed in the previous section.  Instead of looking for an opportunity to fill an underserved aspect of BI, I wanted to see if there is a way to tie the pieces of the BI world together.

In the end, we want to be smarter, we want to make better decisions. A good place to start is to compare why we humans are smarter than other animals. The world has done very well for a few billion years without BI. Simple rules, followed by trillions or so agents, result in the life around us. It’s certainly not just our huge brains which in themselves are just huge hard drives (but with a more intricate structure). But our intelligence involves more complex desires towards success than simply hunger and reproduction. We’ve become symbolic thinkers, which means we can play what-if games in our heads; virtual experiments to test an outcome before we take physically irreversible actions.

At the lowest levels of Map Rock’s design are the notions of rules and troubleshooting. Rules are really about recognizing things both tangible (like a rock) or intangible (like an impending accident). Troubleshooting is the process of the resolution of problems: identifying symptoms, recognizing a diagnosis and applying a treatment.

Troubleshooting isn’t something restricted to “technical” people such as your mechanic or SQL Server performance tuning expert. It’s just the term used by those technical people for “figuring out what’s wrong”, which we all do every day. We’re barely conscious of many of our troubleshooting efforts which can be as mundane as recalling where we left the toothpaste or as complex as figuring out why my poha hasn’t yet fruited.

Identifying symptoms is relatively easy, simply recognizing sets of attributes or they are answers to relatively simple questions. The biggest challenge with identifying symptoms isn’t the answering of the question itself. It is that maybe we aren’t looking for the right things and/or looking for the wrong things, in other words, asking the wrong questions. For example, while the amateur investors are looking for solid performance numbers, the professionals are looking for bubbles about to burst. And, the right and wrong things are different under different contexts.

After we’ve taken inventory of our situation (identified the symptoms), we can “label” the “situation”, consider its own macro object, which is a diagnosis. Has anyone ever seen this set of symptoms before? Yes. Does it have a name? Hodgkin’s Disease?

If we’re fortunate enough to find that someone else has seen these symptoms, we can leverage their experience by applying a treatment used in those previous cases or at least to pick up a few more clues around that previous case. Declaring a diagnosis is also relatively easy, but it’s important to note a couple of addition things about symptoms, a components of a diagnosis. The symptom itself could itself be the result of a diagnosis and our certainty about each symptom may not be as plain as day (meaning, it could just be a best guess).

Treatment is the most difficult part. If we’re lucky, what we are treating has happened many times before and has been rectified through a tried and true process. But out in the wild, because life is a complex system, nothing ever happens exactly the same way. Two events  may look very similar, but they are only similar to some extent, not exact. Therefore:

  • This inherently means that a diagnosis, no matter how many times it has worked in the past, is always at the risk of being incorrect. The devil is in the details. If it looks like a duck and quacks like a duck, it may just be a decoy deployed by a hunter.
  • We must also consider the cost for being wrong. This consideration is too often just a side-note since what could go wrong hasn’t yet happened, and therefore doesn’t seem as important as what is happening right now. And, we’re very good at twisting facts in our head or conveniently sweeping facts under the rug to justify (at least in our minds) why we shouldn’t worry about these things.
  • There may be important data points unknown to us that are required to mitigate risk or at least figure out how to deal with the risk. It’s not that we are negligent, but that it’s fair to say no one in their “right mind” would have thought about it.

If we’re facing a completely novel situation, inventing a treatment is usually more involved than simply applying some IF-THEN logic. Even more, we need to be mindful of what could go wrong and “what am I missing”?

There is an elegant process by which our symbolic thinking works that I attempted to implement in my SCL language, a language I developed based on the Prolog AI language attempting to reflect not just the essence of logic, but the distributed nature of knowledge and effort. I discuss it in general terms in my mini-blog, The Four Levels of SCL Intelligence. Map Rock could be thought of as a more specialized user interface for SCL than a more general version I had been working on that I named “Blend” (as in the blending of rules).

At the core of the processes by which we use Map Rock are three main questions:

How are things related? Relationship is the core of our brains. As we go through life, the things we encounter together at any moment, our experiences, are recorded in our brains as related. A web of relationships of many types (correlation, attributes of, etc) are the protein molecules (I chose to use “protein” to convey complexity and variation) of our applied logic.

How are things similar? This question is the basis for metaphor, which is what opens the door to our thought versatility. Metaphor is our ability to recognize something that is merely similar to another thing. A direct match isn’t necessary. The idea is that if something is similar, we can cautiously infer similar behavior. Without this flexible capability, for anything to happen, there would need to be a direct, unambiguous recognition, which is a very brittle system.

What has changed? Noticing change is what engages many mechanisms. All animals respond to change. Birds sit up high looking for the slightest change, movement, in the scene before them. When attempting to troubleshoot a problem, such as a doctor attempting to resolve an issue, one of the first questions after “How can I help you?” is “What has changed recently?”

Strategy

My goal with Map Rock is to put the “I” back into BI. This notion reflects my career’s roots that began shortly before the AI and Expert System craze of the 1980s. However, the context in which I think of “I” is not the same as truly replicating human intelligence. Back then I was still naïve enough to think that implementation of such concepts were then feasible. So maybe I’m a little unfair since BI was perhaps really never thought of as the corporate version of the sort of software I imagine the CIA must use to facilitate their primary functions. See Note #1. But I’m also referring to moving beyond simply gathering data for analysis by human analysts. As I mentioned in Part 3, I’m after what I call “pragmatic AI”.

With that said, BI has seemed somewhat lackluster to me since the dot-com bust. The “doing more with less” mantra is more about not losing than about winning.  We’re also very fearful of failure (and lately even seem to look down on winning). Any mistakes we make become very public and will haunt us forever as rigid data mining algorithms filter us out due to key words on our record that superficially don’t take into account that we all make mistakes and that the only reliable way to uncover vulnerabilities is through mistakes. It’s one thing to take criminal or even negligible (and that’s a questionable word)action, but not well-intentioned risks towards the goal of winning fairly.

Every single conscious action we decide to take is calculated within the context of a strategy. Strategies are a path meandering through a web of cause and effect that takes a situation from one presumably undesirable state to another desired state. A massive web of webs of cause and effect build in our brains from our experiences.  Some “causes” are things we can directly alter like the volume dial and some are out of our control, at least indirectly. Some effects are good and some are bad. So all day long we try to hit the good effects and avoid the bad ones through logic, these paths, these cascading links of cause and effect.

On the other hand, subconsious actions (like driving) are not in the context of a strategy but in a context of sequences of recognize/react pairs determined through sheer statistical weights. We drive until we hit an “exception”, something out of bounds of the predictive analytics models running in our head. That engages our thinking and formulating of strategies.

It’s important to realize as well that “bad” effects are not to be avoided at all cost. Remember, almost every strategy involves costs. Some can be relatively painful. For example, curing cancer usually involves trading in several “merely” major pains in exchange for one severely major pain. This is called “investment” or “sacrifice”. The reason I mention this is because “Scorecards” sometimes fail to illustrate that some “yellow” KPIs (ex: the typical traffic light showing a yellow light) reflect some pain we actually intended to bear as an investment towards a goal. It is not something that should be rectified. This is very similar to how we may be inadvertently subverting nature’s reactions to a sprained ankle by taking measures to bring down the swelling.

Now is a good time to note that immediately after I say, “cause and effect”, someone reminds me that “correlation does not necessarily imply causation”. It’s rare to find two or more completely unrelated phenomenon that correlate. Usually, strong correlations do share a common driver, even though one may not cause the other. For example, higher levels of disease and mental stress may correlate with higher population densities.

In fact, one of the primary functions of Map Rock is to help assess the probability for causation, part of a procedure I call “Stressing the Correlation”. This procedure, which illuminates and eliminates false positives, includes tests for signs of bias, consistency of the correlation, chronological order of the two events, and identifying common factors.

Please keep in mind that excessive false positives (the TSA screening practically everyone) is the major negative side-effect of avoiding false negatives (missing the true terrorist). At least we can deal with what we see (false positives). One of the major goals of Map Rock is to expose things we wouldn’t think about (false negatives). If we had to choose, I’d say I’d rather deal with excessive false positives than miss a false negative when the cost for being wrong is extreme.

I’m often told that there are patterns out there, that numbers don’t lie. Yes, nature has done very well for herself without human strategy. Getting data faster is a crucial element to the execution of a strategy. The point is, those patterns work beautifully well as long as you understand that at the level of detail below those patterns, a percentage of things are mercilessly plowed into the field for recycling.

Complicated and Complex Systems

The key to appreciating the value of Map Rock is to recognize the fundamental difference between a complicated system and a complex system. The September 2011 edition of the Harvard Business Review included several very nice articles on “embracing complexity”. Paraphrasing one of the articles, the main reason why our solutions still fail or perhaps work but eventually fall apart is that we apply a solution intended for a complicated system to a problem that is really complex. I think we generally use these terms interchangeably, usually using either term to refer to what is really “complicated”.

Machines and the specific things to which they apply are complicated. A screw driver, a screw, the hole in which the screw is applied and the parts it’s fastening are a complicated machine. On the other end of the spectrum, even something as sophisticated as a hive of Hadoop servers is a complicated system. What makes a system complicated and not complex is that we can predict an outcome with a complicated system, even if it takes Newton to run the calculations. We’re able to predict things because all the parts of a complicated system have a specific, tightly-coupled relationship to each other.

The Industrial Revolution is about qualities such as precision, speed, endurance, all of which machines are infinitely better at than humans. We build machines ranging from bicycle tire pumps to semi-conductor fab plants that can output products (air bursts and chips, respectively) with greater productivity than is possible with just the hands of people. Today, we still continue an obsession with optimization of these systems by eliminating all variance (most of which is human error), minimizing waste, especially defects and down time.

This distinction between complicated and complex is incredibly profound when making business decisions because:

We cannot settle for “just one number, please” predictions in a complex system. We can make accurate predictions within a consistent context. Complex systems are not a consistent context. For example, we can develop data mining algorithms to predict how to attract a patient into a dental office based on the patterns of that office’s patients. However, that same model probably will fail miserably for a practice in another neighborhood, state, or country. The best we can do is hope that the context changes slowly enough so our models at least work to some extent at least for a while.

Strictly speaking, there are probably no complicated systems in real life. Really, I can’t think of anything on earth that operates in a vacuum. Everything is intertwined in a “Butterfly Effect” way. Even a vacuum as we generally mean is a vacuum in that it is devoid of matter, but not things like gravity and light passing through. Every complicated system I can think of is only an illusion we set up. We draw a box around it and limit our predictions to a limited space and time hoping that nothing within that time will change enough to affect our predictions.

Figure 1 illustrates how we create the illusion of a closed system. We encase the system (the green circle representing a car) within a layer of protective systems (the blue layer) protecting it from the complexity of the world. I purposefully chose a dark gray background in Figure 1 to convey the notion of an opaque complex world, a “dark-gray” box, not quite a “black box”.

Closed Systems

Figure 1 – Complicated systems are closed systems. We create “virtual” closed systems by protecting it through various mechanisms from the complexity of the real world.

Of course, the protective systems cannot protect the car from everything conceivable. It will not protect it from a bridge falling out from under it or a menacing driver in another car.

The Complexity is Only Getting Worse

A system’s complexity grows with the addition of moving parts and obstacles which complicate the movement of the moving parts. Following are a few example of Forces adding moving parts, which directly increases complexity:

Globalization. Each country has its own sets of laws, customs, and culture. Working with different sets of rules means we need to be agile and compromise, which complicates our efforts.

Accumulating regulations and Tightening Controls. Constraints only add to complexity. They may not be a moving part, but act as a roadblock to direct options. There are so many regulations, collectively millions of them at all levels of government in the US alone) in play that most of them must be in conflict with others. I wouldn’t be surprised if we all inadvertently broke some law every day. Ironically, regulations are an attempt to simplify things by removing variance and events that can cause a lot of trouble.

Growing population and affluence. Each person is a moving part of our global economy. More affluence means more people make more decisions with wider scope, are more active, touching more things, whether it’s as a consumer or as a worker.

The number of “smart” devices which can even be considered semi-autonomous. These devices that make decisions (even though they may be mundane) is also a moving part. See note #2 for an example of one that happened to me today.

Increasing Pace of the world. Even if no moving parts were added, the increasing pace of things adds as much to growing complexity as the number of moving parts. The faster things are going, the more spectacularly they will crash. Not too many things scale linearly and increased load will add complication as different rules for different scales engage.

More demands on us. With more and more regulations and responsibilities hoisted upon us, we’re forced to prioritize things, which opens many can of worms. In essence, priotization means we are choosing what may not get done or at best will be done half-heartedly with probably less than minimal effort. That can result in resentment from the folks we threw under the bus or other sorts of things that add more problems. It forces us to spend the least amount of energy and resources as possible so we can take on the other tasks. We learn to multi-task but that may lower the quality of our efforts, at least for the tougher tasks (easy tasks may not degrade in quality of effort).

De-hierarchization/de-centralization of corporate life. Last, but definitely not least. This leads to more and more moving parts as decision control is placed into more people (and even machines) who are now better trained and tied in through effective collaboration software. However, decentralization is really a good thing that mitigates, if not removes, bottlenecks, enriches the pool of knowledge from which decisions within the enterprise are made, and drastically improves the agility of a corporation. Decentralization is really the distribution of knowledge across an array of people who can proceed with sophisticated tasks minimally impeded by bottlenecks. See Note #3 for more on this concept and Note #4 on Slime Mold.

Embracing Complexity

When I’m in a room with other engineers brainstorming a solution, we’ll agree that a suggestion is too complicated or complex. We then back away from that suggestion, usually ending up sacrificing feature requests of varying importance to the “nice to have” pile (and never actually getting to it). I have no problem with that at all.

Although I believe many who know me will disagree, I don’t like complicated answers (see Note #5). Complications mean more restrictions, which means brittle. Complications happen when you try to have your cake and eat it too, which is a different issue from getting too fancy with all sorts of bells and whistles. What I mean is when we want to accomplish something, but there are constraints, need to include safeguards to protect those constraints. Constraints handicap our options. We can usually engineer some way to have our cake and eat it too, but eventually we will not be able to patch things up and the whole thing blows up.

When I began developing SCL way back when, my thought was how to embrace complexity, tame it, and conjure up ways to deal with the side-effects. The problem is that to truly embrace complexity, we need to be willing to let go of things and often have no choice as to which of those things go. But it’s one thing to be a non-self-aware species of bird that goes extinct as species that are more fit to current circumstances thrive and a self-aware person fighting for survival among billions of other self-aware beings. In a sense, everyone is a species of one.

I am incredibly far from having the answers. But what I do claim (at least I think I do) is that I have a starting point. It involves decentralizing control to networks of “smarter” information workers acting as a “loosely coupled” system (which works very well for developing complicated software systems). Most importantly, at least for Map Rock Version 1, is to accept and deal with the limitations of logic.

The Limitations of Logic

Whatever we personally know (knowledge in our individual head) are the things we’ve experienced: We only know what we know. Obviously, we cannot know things we haven’t experienced personally or that hasn’t been conveyed to us (learned directly) through some trusted mechanism (ex, someone we trust). Everything we know is a set of relationships. For example, an apple is recognized when we see a fruit with the classic shape, smell, color, etc.

That’s all fine and dandy until we attempt to infer new knowledge from our current knowledge. Meaning we take what we know, apply logic, and indirectly come up with a new piece of knowledge. How many times have we found something we were sure about to be wrong and when we figure out what went wrong we say, “I did not know that!” Meaning, had we known that, we would have come to a different conclusion.

Our mastery of the skill of logic relative to other animals is the secret sauce to our so-called “superiority” over them. However, for inter-human competition, all of whom have this power of logic, one needs superior logical capability as well as superior knowledge from which we can draw inferences. Logic is great as we use it to invent ways to outsmart nature (at least for the moment) who isn’t preying on us (nature herself isn’t out to get us). But as Superman was nothing when facing his enemies with the same powers in Superman 2 (General Zod, et al), we need to realize our logic can be purposefully tripped up by our fellow symbolically-thinking creatures. As we wrap a worm in a hook to catch a fish, our own kind does the same to us out in the world of politics and commerce. I wrote about this in my blog, Undermined Predictive Analytics.

The limitations of our beloved logic stem from the fact that we cannot possibly know everything about a system. There is no such thing as perfect information. The complexity of the world means things are constantly changing, immediately rendering much of what we “know” obsolete. However, our saving grace is that for the most part in our everyday world, a system will be stable enough over a limited volume of space and time for something we’ve learned to apply from one minute or day or year to the next.

When I mention this, people usually quickly quip (say that five times), “Garbage in, garbage out”, which entirely misses the point. Of course bad information leads to bad decisions. But even perfectly “correct”, perfectly accurate data points (perfect information) can lead to naïve decisions in the future. The inferences our logical minds make is limited to the relationships accumulated in our brains over the course of our lives; our experiences.

We usually think of things in terms of a complicated system even if the system is complex because animal brains evolved to be effective within a limited space and time. That limited space and time is all that’s needed for most creatures just out to make it through another day. Decisions still seem to work because in the limited space and time, underlying conditions can remain relatively static, meaning something that worked two days ago has a good probability of working today. Additionally, the basic interests of humans are relatively stable and thus provide some level of inertia against relentless change, which adds to the probability that what worked yesterday still has a chance to work a year from now

Our brains evolved to solve problems with a horizon not much further than until the next time we’re hungry. For anything we do, we can pretty much ignore most things outside the immediate problem as just extraneous noise. Thinking long term is unnatural, so we don’t care about any butterfly effect. Thus we really don’t have a genuine sense of space spanning more what we encounter in our day to day lives nor of time spans much beyond a few human lifespans.

Software Sclerosis

Software Sclerosis – An acute condition of software whereby the ability for it to be adapted to the changing needs of the domain for which it was written is severely hindered by the scarring of the excessive addition of logic over time.

As the name of my company, Soft Coded Logic, implies, the primary focus of mine is how to write software that can withstand inevitable changes through built-in malleability. I’m not talking about just change requests or new features added in a service pack or “dot release” (like Version 1.5). I’m talking about logic, those IF-THEN rules that are the basis of what “code” is about. Changes are inevitable because we live in a complex world and logic has readily apparent limitations in a complex world. How can software adjust to rule changes without ending up a victim of “Software Sclerosis”, a patchwork of rules, a domain so brittle that most rules probably contradict something? On the other hand, flexibility can sometimes be “wishy-washy”, which means software cannot perform as optimally as it can.

Soft-coded logic, had always been my passion. I mentioned earlier that I began my software development career in 1979 and was heavily influenced by the Expert System craze of the 1980s. But software projects became modeled under the same paradigms as those to build a bridge or building. The bridge or building must fulfill a set of requirements, which beyond functional requirements includes regulatory requirements, a budget, and a timeframe. Software has similar requirements except that because rather ethereal, not as tangible and rigid as a bridge, it is the path of least resistance and so is the natural choice for what must yield when things change.  It’s easier to modify software to operate on another operating system than it would be to retrofit a bridge to act as an airplane runway as well.

Short of developing a genuine AI system, one that genuinely learns and adjusts its logic (the latter much harder than the former), we can only build in systems to ameliorate the sclerosis. The problem is that the value of these systems or methods is not readily apparent and just as importantly they weigh the system down when it’s running (not streamlined). So such systems/methods are quickly deemed “nice to haves” and are the first things to be cut in a budget or time crunch.

BI systems are rather rigid too:

  • OLAP cubes reflect a fixed set of data, which means it can pre-aggregate in a predictable manner, thus fulfilling its prime mission of snappy (usually sub-second) query response.
  • Data Marts and Data Warehouses are still based primarily on relational databases which store entities as a fixed set of attributes (tables).
  • “Metadata” still primarily refers to things like the database, table, and field names of an entity attribute, as opposed to the “Semantics” of an attribute.
  • Definitions of calculations and business rules are still hard-coded. The great exception are data mining models where the selection of factors and their relationships can be automatically updated to reflect new conditions … at least to an extent.
  • Users still mostly consume BI data as pre-authored reports, not through analytics tools – based on the feedback I get about practically any analytics tool being a quant’s tool.
  • Basic concepts such as slowly-changing dimensions is still more of an afterthought.

Technologies I mentioned in the Part 3 topic, “Why is Map Rock Different?” , such as metadata management and predictive analytics as well as technologies like columnar databases and the Semantic Web will help to reduce the “plaque of quick fixes” in today’s software. But I hope Map Rock can “surface” the notions of malleability higher up the “stack” to information workers, that is, beyond current Self-Service BI. Developing Map Rock, I did my best to incorporate these things into its DNA while at the same time avoiding too much overhead by going “metadata crazy” and more importantly, developing systems that ameliorate the terrible side-effects of being metadata-driven.

Coming Up:

  • Part 5 – We close the Problem Statement with a discussion on imagination, which is how we overcome the limitations of logic, and how it is incorporated into Map Rock.
  • Map Rock Proof of Concept – This blog, following the Problem Statement series, will describe how to assess the need for Map Rock, readiness, a demo, and what a proof-of-concept could look like.

Notes:

  1. Obviously, I’ve never worked for the CIA because I seriously doubt I’d be able to even publicly suggest what sort of software they use. I would imagine their needs are so unique and secret that their mission critical applications are home-grown. But then, it’s like not I’ve never been surprised by learning a BI system consists of hundreds of Excel spreadsheets.
  2. Junk mail filters are one of these semi-autonomous decision makers. Today it made a decision that could have profoundly affected my life. It placed a legitimate response to a position for which I was sincerely interested into my junk mail. It was indeed a very intriguing position. I don’t usually scan my junk mail carefully and so it could very easily have been deleted. My point is that such semi-autonomous software applications or devices do affect things, adding to the complexity of the world.
  3. Dehierarchization, distribution of decision-making, is very analogous to a crucial design concept in software architecture known as “loosely coupled”. Instead of a monolithic, top-down, controlled software application, functions are delegated to independent components each with the freedom to carry out their function however the programmer wishes and as unobtrusively as possible (plays well with the other components). Each is responsible for its maintenance and improvement. Without this architectural concept, the capabilities of software would be hampered due to the inherent complexity of a centrally controlled system.
  4. Slime mold is one of the most fascinating things out there. You’ve probably seen it in your yard or on a hike at some time and thought it to be some animal vomit. It is a congregation of single-celled creatures that got together into what could be called a single organism. When food is plentiful, these cells live alone and are unnoticed by our naked eye. When food is scarce, they organize into this mass and can even develop mechanisms to hunt for food.
  5. I think engineers are often thought to over-think things and are prone to over-engineering – which I think is a good thing. But it’s often because we are aware of many things that can go wrong even if things may seem great on paper. I believe we also innately realize that there is an underlying simplicity to things and that if something is too hard, it’s often not right. When faced with something I need to engineer, I can only start with the technique I know works best. I may find it’s not good enough (or suspect it’s not good enough), which will lead me to research a better way or I may need to invent a better way. In any case, engineering involves the dimension of time, a limited resource. So we engineers weigh “good enough for now” with “so bad that there must be a better way”.
Posted in Map Rock | Tagged , , , | Leave a comment

Polynomial Regression MDX

About a year and a half ago I posted a blog on the value of correlations and the CORRELATION MDX function titled, Find and Measure Relationships in Your OLAP Cubes. However, the CORRELATION function calculates only linear relationships, which means that the polynomial nature of more of the juicier relationships out there are somewhat poorly measured.

Most real relationships change as the values scale to larger or smaller extremes. For example, working sixteen hours per day will not be twice as productive as only eight hours. It’s only natural as hardly anything is ever allowed to grow indefinitely. There are tempering and exacerbating factors, some form of “diminishing return” or on the other hand a snowball effect. Figure 1 illustrates the polynomial (red line) and linear (green line) relationship between the fuel consumption (mpg) and your speed on the freeway (mph). It shows that fuel efficiency rises until it hits a peak at about 45 mph, then declines after about 60 mph, in significant part due to changes in aerodynamics at those higher speeds.

Miles per Gallon vs Miles Per Hour

Figure 1 – Miles per Gallon vs Miles per Hour. Sections are linear enough.

The polynomial relationship is very tight with a correlation strength (R2) of .9382. However, the linear relationship shows up rather weak with an R2 of .2782. The polynomial and linear relationships are so different that they actually contradict!  From the graph, it’s easy to see that the polynomial figure makes more sense. The need for polynomial measurement can often be avoided if we stick to a limited range of measure. In Figure 1, the orange circles show that the correlation is pretty linear between 5 and 35 mph and again between about 55 and 75. But from 5 to 75, it follows a fairly tight polynomial curve.

So polynomial relationship calculations are superior, but the problem is that they are more calculation-intensive. And for Analysis Services folks, there isn’t a native MDX function for polynomial regression as there is for linear regression (CORRELATE). We need to reach back to high school algebra and write out an old-fashioned polynomial (y=ax2+bx+c) for this using calculated measures. The calculations are all pretty simply, mostly just a bunch of SUMing and squaring x and y in all sorts of manners.

There are certainly tons of other methods, most better, for calculating polynomial regressions. However, this method pushes SSAS and MDX as far as I’d feel comfortable doing as it still performs fairly well. I should also point out that this blog is focused on finding relationships between combinations of measures (or even members) and doesn’t go to the next step of using it for forecasting (plugging in x into the y=ax2+bx+c formula to get y) – which is better served using data mining models.

Incidentally, as a side note, the blog I mentioned earlier, Find and Measure Relationships in Your OLAP Cubes, was what inspired me to develop Map Rock. I realized that once one plays with exploring correlations, some of the things you would want to do aren’t straight-forward in what were then the typical OLAP browsers. An example is how easily prone to supplying misinformation these correlations can be as we saw above. Those issues were the subject of a follow-up blog which I didn’t end up posting as I realized that would make a pretty neat application.

The MDX

The MDX used to describe this technique is in the form of a SELECT statement that can be downloaded and run in SQL Server Management Studio. Here are a few points before we get into a walk-through of the MDX:

  • The technique is equivalent to Excel’s Polynomial to the 2nd Order Trend Line, as Figure 2 illustrates.
  • Because this is just algebra, I will not go deeply into an explanation of the calculations as it is fairly elementary – although I needed to think it through myself after not seeing the actual equations for so long.
  • This MDX sample uses the AdventureWorks sample cube.
  • I’m using SQL Server 2008 R2 and Excel 2010.

Excel Trendline Options.

Figure 2 – Excel Trendline Option, Polynomial to the 2nd order.

Figure 3 illustrates where we will end up with this walk-through. It shows the relationship between [Internet Sales Amount] and [Sales Amount] for the Product Subcategory, Shorts.

Miles per Gallon vs Miles Per Hour

Figure 3 – Measure correlation for “Shorts”.

The R2 value of .6615 demonstrates moderate correlation, but it is deceptive as there is a clear outlier toward the bottom (Jul-04) that is skewing the result somewhat. I left the outlier in because I can’t stress enough how this technique doesn’t take into account the removal of outliers. Figure 4 shows that removing the outlier yields an almost non-existent correlation.

Miles per Gallon vs Miles Per Hour

Figure 4 – Measure correlation for “Shorts without the outlier”. The correlation isn’t good at all without that outlier.

If you’d like to play along, open up an MDX window in SQL Server Management Studio (I’m using SQL Server 2008 R2) and  open this script, which will be described in the following paragraphs.

There are three main sets of calculations in the MDX. Figure 5 shows what I call the parameters. The three parameters mean that we are looking for the relationship between [Internet Sales Amount] and [Sales Amount] based on the months from August 2003 through July 2004. Notice that there is a line commented out for [Internet Tax Amount]. That is to test a different measure for “Y”, [Internet Tax Amount] instead of [Sales Amount]. (If you did try [Internet Tax Amount], you will be a perfect correlation since the tax amount is directly proportional to the sales.)

Figure 5 – “Parameters” of the MDX demonstrating polynomial relationship calculation.

Figure 6 shows some of the intermediate calculations for finding the “a, b, and c” (remember, y=ax2+bx+c) values of the polynomial. Again, it’s just a bunch of squaring and summing, the same old stuff I’m sure people have implemented many times, Excel included. Figure 6 doesn’t show the more important “a, b, and c” calculations because they are rather verbose and I didn’t want to include such a large snapshot.

Figure 6 – Intermediate calculations for determining the a, b, and c values of the polynomial.

Figure 7 shows the actual select part of the MDX along with the star calculation, “Relationship” (R2). This MDX will show the strength of the correlation between the [Internet Sales Amount] and [Sales Amount] for each product subcategory.

Figure 7 – The business end of the MDX.

Notice as well that there is a line commented out for the customer Gender/Education level. You can try this out after this walkthrough focused on product subcategory.

If you are playing along, I should mention there are two queries in the script, this query and a test query. Be sure to highlight the one you intend to run.

Executing the MDX will yield what is shown in Figure 8 (partial results). The Relationship values show a value from 0 through 1, where 0 is absolutely no correlation and 1 is perfect correlation.

Figure 8 – Result of the MDX. The “Relationship” column shows the R2 value.

I’ve highlighted (red circle) “Shorts” as the product subcategory we will test. Notice though that there are many quirky values:

  • Lights and Locks show a value of 1.000, a perfect correlation. However, that’s because all of the values are null.
  • Mountain Frames shows -1.#IND. In this case, the Internet Sales Amount for all months are null, but there are values for Sales Amount.
  • You can’t see it here, but some of the Relationship values will not match Excel. That is because for some of the product subcategories, the values for Jul-04 are null. “Mountain Bikes” are an example.

Figure 9 shows the MDX used in the first step to test the Relationship values against what Excel will calculate (as illustrated in Figure 3). Notice that I slice (WHERE clause) to return values for “Shorts”.

Figure 9 – Test the R2 value for shorts.

Figure 10 shows the month by month values used to derive the polynomial used to calculate the relationship strength.

Figure 10 – The Internet Sales Amount and Sales Amount for Shorts by month.

Follow these steps to duplicate what is shown in Figure 3 (well, almost – I did do some cleaning up):

  1. Copy/Paste the entire contents of the Result pane into an Excel spreadsheet.
  2. In the Excel spreadsheet, select just the Internet Sales Amount and Sales Amount columns.
  3. Click on the Insert tab, click the Scatter icon, and select the plain Scatter plot (the one in the upper-left corner).
  4. Right-click on any of the plotted points and select “Add Trendline”.
  5. Select Polynomial and check the “Display R-Squared value on Chart” and “Display Equation on chart” items. Close.

Limitations

Implementing these calculations into the MDX script is easy for the most part. Just add the calculations, setting the appropriate visibility. What will be clumsy is dynamically selecting the measures for X and Y. There isn’t a straight-forward way to select two measures from most cube browsers. My only thought right now would be to set up two pseudo Measures dimensions where each member is mapped to a real measure (using SCOPE). Then we can select x and y from those dimensions. That’s a blog in itself.

Additionally, in “If You Give a Mouse a Cookie” fashion, after you begin playing with relationships, you’re going to want to:

  • Drill down to the details of the correlation set, as we did in the example.
  • Select the same hierarchy across rows and columns. For example, we may want to cast for the correlation between the [Internet Sales Amount] for each product and the associated ad campaign cost (assuming product-level costs exist).
  • Handle outliers.
  • Have more control over the nuances of the correlation algorithm (from a calculation and performance point of view) than is allowed through MDX.

Those are in fact among the initial thoughts I had a year and a half ago when I first created the Visual Studio 2010 project for Map Rock. Please do take a look at the Map Rock Problem Statement for much more on those thoughts.

From a performance point of view, each cell involves many calculations, so the number of cells calculations are many. The good news is that the calculations aren’t the sort that generates thousands of “Query Subcube” events. Currently, the MDX is pretty snappy (even on cold cache), but modifications to handle the quirks I described in the walk-through would have noticeable effects.

Posted in BI Development, SQL Server Analysis Services | Tagged , , , , , | Leave a comment