Anomalies in data can skew analysis leading to incorrect conclusions and subsequent actions that could be useless or even damaging as it will address the wrong problem. Anomalies in data are usually attributed to human data entry error or perhaps are the result of complex ETL calculations that may be absurdly wrong (like a NaN value or −2,147,483,648, the -1 INT value). Even though manual measurements of things such as vital signs during a physical exam are becoming a thing of the past (using electronic devices linked directly to a computer), devices used to take measurements are still prone to the occasional anomaly. An anomaly could represent readings under very unusual circumstances or it could represent a bug in the device.
One of the most obvious hint of an anomaly is that it exceeds a normal range of values. So someone analyzing data based on these measurements may wish to exclude facts with such anomalous values from her analysis. Removing these anomalies from analysis without some sort of great cost isn’t as easy as it sounds (I always say the BI world is a zero-sum game where everything is a trade-off). To illustrate, let’s say the measurement in question is a patient’s heart rate for a visit to a hospital. We’ve built a cube with the visits as a fact table along with various dimensions such as demographic attributes and symptoms.
As we slice and dice, say looking at visits for a set of patient age groups by a set of presented symptoms, we may wish to see the average heart rate for each tuple excluding the visits with heart rates greater than at the 98 percentile and less than at the 2 percentile (roughly speaking for the sake of argument, two standard deviations). For example, if the age group of 18-34 that presented a fever includes 1000 visits (sorted by heart rate), we want to exclude the visits where the heart rate is greater than the heart rate of the visit at position 980 and less than the heart rate of the visit at position 20.
It wouldn’t be good enough to simply remove facts with heart rates outside a specified range from the cube. The average heart rate will be different under different conditions (as specified by slicing and dicing through a cube). Say the average rest heart rate for healthy people is about 70 and the value ranges from 50 to 90. For people with fevers, the average may be higher at 80. Simply removing any facts with heart rates less than 50 or greater than 90 will not correctly reflect the average heart rate of facts meeting particular conditions.
Overview of the Algorithm
This algorithm that is the subject of this blog applies to semi-additive values such as unit prices of many sorts, measurements such as heart rates or cholesterol levels, or analog configuration settings. Such measures are not additive. For example, it makes no sense to offer the sum of all heart rates for an age group.
The most straight-foward (but terribly bad) approach for incorporating this capability in an OLAP cube (for the heart rate at visit example) is this procedure:
- For each tuple, create a set of leaves (lowest granularity cells), which essentially is a set at the fact level. This set can potentially include as many items as there are facts.
- ORDER the set in ascending order by heart rate.
- Find how many visits comprise the top 2% or rows and the bottom 2% of rows.
- Get the top 98% of the set using the HEAD function, which removes the bottom 2% of rows,
- Shave off the bottom 2% of rows using the TAIL function.
- Use the AVG function to get the average heart rate of that trimmed set.
The problem with this approach, as mentioned in Step 1 above, is that there could be an incredible number of items in the set of leaves. This function could take many seconds or even many minutes to resolve as it reads at the fact level. A better solution centers around a hidden dimension that under normal usage would have very little analytical value, a Count measure, and a few hidden calculated measures which includes the use of the dreaded FILTER function. It could also involve a neat recursive calculation for optionally providing the value at the 2 and 98 percentile marks.
I came up with this better technique a few years ago helping out a healthcare-focused ISV with a very similar problem (something more exotic than heart rates). I later adapted this technique to AdventureWorksDW for calculating the average Unit Price without the outlier values and incorporated it into my MDX training workshop. However, it is a very advanced technique and only one class ever got so far as to use it as a live exercise. This need recently came up and so I had to dig this up after not thinking about this for a couple of years. I was very pleased it came up as it means the sophistication of the customer’s analysis is improving.
In a nutshell, the idea is to create an inexpensive way to obtain a compressed list of sales transactions by distinct unit prices in the fact table, then determining at which unit price the top and bottom 2% of sales transactions begins. Figure 1 shows end result of the efforts slicing by product category. Note though that for the purposes of this example, I’m using 10 and 90 percentile instead of 2 and 98 percentile as the cutoffs since the data in AdventureWorksDW doesn’t have many different unit prices (42 unique prices).
Figure 1 – Percentile prices and positions.
The last column of Figure 1 shows the number of sales transactions for each product category. For example, the Accessories category has 36,092 sales transactions. The second column (Sales Count At 10 %tile) shows the number of transactions down the list of the sales transactions. For Accessories, the 3,609th transaction is at the 10 %tile mark. The third column (Unit Price At 10 %tile) shows the unit price of that 3,609th sales transaction. The fourth and fifth columns show the corresponding data for the 90 %tile.
Figure 2 verifies this. We see a count of sales transactions at each unit price for Accessories. We can see that the 3,609th transaction has a unit price of 3.99 and the 3,609th from the bottom has a unit price of 34.99 (249+328+733+1396=2,706 which is less than 3,609).
Figure 2 – Where is the 10 percentile mark?
Figure 3 shows the average unit price of Accessories without the outlier values of 2.99, 35, 54.99, 120, and 159 and the average unit price overall.
Figure 3 – Prices by category without outliers.
The first step is to create a new dimension of the unique Unit Prices from the FactInternetSales fact table (InternetSales measure group) of AdventureWorksDW. Figure 4 shows this dimension which has an attribute named [Unit Price] (with [FactInternetSales].[UnitPrice] as the key column) as its dimension key. So each member of the [Unit Price] attribute is a unique value of the [UnitPrice] column of the [FactInternetSales] table.
Figure 4 – Unit Price dimension.
The [Unit Price Iterate] attribute is set exactly like the [Unit Price] attribute. Be sure to check that the OrderBy property set to Key as well.
Other important property settings for the [Unit Price] and [Unit Price Iterate] attributes include:
- The OrderBy is set to Key so the set of unit prices are in ascending order. An important component of the algorithm is that the members are sorted.
- You should set the AttributeHierarchyVisible properties to False since slicing and dicing these attributes by the users is probably not useful. With all attributes of the [Unit Price] dimension set to be hidden, the dimension itself will not display in the OLAP browser.
Add the [Unit Price] dimension to the [AdventureWorks] cube. It should automatically link correctly to the FactInternetSales measure group via the [UnitPrice] column as shown in Figure 5.
Figure 5 – Dimension usage for the Unit Price dimension.
The final step is to add the necessary calculations to the MDX script.
First we need to know what positions in the set of sales transactions marks the 10 and 90 %tile. Figure 6 shows two calculations for these values. It is simply multiplies the sales transaction counts by .1 and .9, respectively.
Figure 6 – MDX calculation for Sales Count at the 10 percentile mark.
Figure 7 shows the calculation that is the end product of all this nonsense. It divides the sum of all unit prices for sales transactions excluding the ones with unit prices outside the selected %tile ranges by the number of sales transactions also excluding the ones with unit prices outside the selected %tile ranges.
Figure 7 – MDX calculation for average unit price without outliers.
The transactions outside the %tile range are FILTERed out by iterating through the unit prices (starting from smallest to largest) and excluding the [Unit Price] members when its cumulative number of sales transaction is less than (for 10 %tile) or greater than (for 90 %tile) the number of sales transactions transactions that comprise the respective %tile. The key is that we get the number of sales transactions for each unit price not by counting what could be millions of sales facts at query time, but by summing the aggregation of sales transaction counts for each unit price . Because [Fact Internet Sales Count] is an additive measure, this should be relatively quick if the aggregations include this measure.
Normally, I avoid the FILTER function like the plague, but in this case, there are relatively few members of [Unit Price] and it still beats going to the leaf level by a long shot.
Figure 8 shows the supporting calculations. [Unit Price Amount] is simply the member name of the [Unit Price] attribute cast as a “double” (numeric) value. [Rank] obtains the ordinal position of the [Unit Price Iterate] member that the FILTER function is on so the [RunnigCount] calculation knows where to SUM to in the set of [Unit Price] members.
Figure 8 – Supporting MDX calculations.
Figures 9 and 10 show a set of optional calculations that return the unit price at the 10 percentile mark. It is a recursive calculation that is triggered by [Unit Price at 10 %tile] (Figure 9) and calls [Unit Price at 10 %tile Recursive] (Figure 10). The important difference between the two is that the former kicks off the recursive calculation starting with the first member of the [Unit Price] attribute and the latter picks up from that member. The gist of this recursive calculation is that it iterates through the set of unit prices until the running count of sales transactions is greater than or equal to the sales transaction count at the 10 %tile mark ([Sales Count At 10 %tile]). If it’s less, it continues to iterate by attempting to get the value for the next unit price by calling the recursive [Unit Price At 10 %tile Recursive] calculation.
Figure 9 – Starting MDX Calculation for Unit Price at 10 percentile.
Figure 10 – Recursive part of MDX calculation for Unit Price at 10 percentile.
[Unit Price At 10 %tile Recursive] will then call itself until the [RunningCount] at the current [Unit Price Iterate] member is greater than or equal to what it should be at the 10 %tile mark. At that point, it returns [Unit Price Amount] and pops up the recursion stack.
A set of calculations to return the unit price at the 90 %tile mark involves the exact calculations as for the unit priace at the 10 %tile mark except referring to [Sales Count At 90 %tile] rather than [Sales Count at 10 %tile].
The important thing to note about utilizing the recursive technique is that it offers a way to iterate through a set and “break” out of the iteration like the break key word used in foreach or while loops. In this case, we iterate through the set of unit prices (using the [Unit Price Iterate] attribute) until the running sum of the sales counts matches or exceeds the number of sales transactions at the 10 percentile mark.
Most of these calculations should probably be hidden to hide the complexity from the user. I would think only [Avg Unit Price wo Outliers] and/or the optional [Unit Price At 10 %tile] or [[Unit Price At 90 %tile] would be exposed to the end users.
The drawbacks of this technique are:
- There is a WITH TIES problem. For example, what if the unit price marking where the top 2% of sales transactions starts includes many transactions? That 2% mark could actually account for the top 10% or more. It this is the case, the cutoffs should perhaps be more like .1% and 99.9% instead.
- It still could be slow compared to dealing with well-aggregated, additive measures. But it should be up to magnitudes faster than going to the leaf levels.
- Some scenarios such as for stock prices may have very many unique prices, ranging from 1 cent to around $100,000 (for Berkshire Hathaway) in intervals of a cent. In these cases, perhaps the values could be rounded to reduce the number of members, perhaps to the nearest dollar for anything over $10, nearest $100 for anything over $10,000, etc. That would involve creating a named calculation in the fact table that includes such logic and using that named calculation as the dimension key. Incidently, we can create a third “Unit Price” attribute that discretizes the unit prices into buckets in order to break up products by “Costly Things”, “Low Priced Things”, etc. Unlike the hidden attributes used to implement the algorithm, this one would be analytically useful and exposed to the users.
Additionally, while not a drawback of the algorithm, I should mention that the AdventureWorksDW cube doesn’t present a good example for the use of this algorithm as it doesn’t include erroneous unit prices. So, nothing interesting will actually show up. In fact, most cubes probably don’t include anomalies as they are filtered out as part of an ETL cleansing process. In the same vein, I should advise that there is a difference between anomalies and errors. Values that can be identified as errors certainly should be filtered out. It’s just that sometimes it’s hard to tell through simple rules which is which.
Exceeding a range of values shouldn’t be taken as a sole red flag. Such anomaly values could indeed be correct and an indicator of important things to come whether they are impending “black swan” events, growing “weak links“, or “tipping points” that could drastically undermine a business model. It’s for these reasons that it can be a good idea not to filter out facts with such values at the ETL level with such a simple rule.
Here is a good example for not filtering out outliers at the ETL process: Yesterday I was listening to BSU lose their chance at a BCS game and even the lowly Mountain West championship. The announcers were talking about an offense that had just a few yards rushing, but it was important to point out how misleading that statistic is in that it included three plays with big losses (outliers). I thought that if I were helping the coach to pick a play, I wouldn’t be able to just throw out those “outliers”. They expose a huge problem that would result in a double whammy: An incorrect interpretation of the low rushing yardage leading to the wrong efforts to fix it and ignoring the real problem allowing it to continue happening.
A better way to define an anomaly is to consider what was expected versus the actual value. A shockingly off value is certainly a sign of something to investigate. Of course, that’s laughingly much easier said than done. However, this needs to be addressed as opposed to retreating back to some sort of “lowest common denominator” solution – one that can be easily implemented and that most people can understand.
That is a big subject and essentially the purpose for the application I’ve been building that I’ve named Map Rock. To me, Map Rock is the Holy Grail of BI that I’ve been trying to build for the past ten years; usually overshooting the maturity of the BI market. More on Map Rock coming up …
Figure 11 – Map Rock logo.