Polynomial Regression MDX

About a year and a half ago I posted a blog on the value of correlations and the CORRELATION MDX function titled, Find and Measure Relationships in Your OLAP Cubes. However, the CORRELATION function calculates only linear relationships, which means that the polynomial nature of more of the juicier relationships out there are somewhat poorly measured.

Most real relationships change as the values scale to larger or smaller extremes. For example, working sixteen hours per day will not be twice as productive as only eight hours. It’s only natural as hardly anything is ever allowed to grow indefinitely. There are tempering and exacerbating factors, some form of “diminishing return” or on the other hand a snowball effect. Figure 1 illustrates the polynomial (red line) and linear (green line) relationship between the fuel consumption (mpg) and your speed on the freeway (mph). It shows that fuel efficiency rises until it hits a peak at about 45 mph, then declines after about 60 mph, in significant part due to changes in aerodynamics at those higher speeds.

Miles per Gallon vs Miles Per Hour

Figure 1 – Miles per Gallon vs Miles per Hour. Sections are linear enough.

The polynomial relationship is very tight with a correlation strength (R2) of .9382. However, the linear relationship shows up rather weak with an R2 of .2782. The polynomial and linear relationships are so different that they actually contradict!  From the graph, it’s easy to see that the polynomial figure makes more sense. The need for polynomial measurement can often be avoided if we stick to a limited range of measure. In Figure 1, the orange circles show that the correlation is pretty linear between 5 and 35 mph and again between about 55 and 75. But from 5 to 75, it follows a fairly tight polynomial curve.

So polynomial relationship calculations are superior, but the problem is that they are more calculation-intensive. And for Analysis Services folks, there isn’t a native MDX function for polynomial regression as there is for linear regression (CORRELATE). We need to reach back to high school algebra and write out an old-fashioned polynomial (y=ax2+bx+c) for this using calculated measures. The calculations are all pretty simply, mostly just a bunch of SUMing and squaring x and y in all sorts of manners.

There are certainly tons of other methods, most better, for calculating polynomial regressions. However, this method pushes SSAS and MDX as far as I’d feel comfortable doing as it still performs fairly well. I should also point out that this blog is focused on finding relationships between combinations of measures (or even members) and doesn’t go to the next step of using it for forecasting (plugging in x into the y=ax2+bx+c formula to get y) – which is better served using data mining models.

Incidentally, as a side note, the blog I mentioned earlier, Find and Measure Relationships in Your OLAP Cubes, was what inspired me to develop Map Rock. I realized that once one plays with exploring correlations, some of the things you would want to do aren’t straight-forward in what were then the typical OLAP browsers. An example is how easily prone to supplying misinformation these correlations can be as we saw above. Those issues were the subject of a follow-up blog which I didn’t end up posting as I realized that would make a pretty neat application.

The MDX

The MDX used to describe this technique is in the form of a SELECT statement that can be downloaded and run in SQL Server Management Studio. Here are a few points before we get into a walk-through of the MDX:

  • The technique is equivalent to Excel’s Polynomial to the 2nd Order Trend Line, as Figure 2 illustrates.
  • Because this is just algebra, I will not go deeply into an explanation of the calculations as it is fairly elementary – although I needed to think it through myself after not seeing the actual equations for so long.
  • This MDX sample uses the AdventureWorks sample cube.
  • I’m using SQL Server 2008 R2 and Excel 2010.

Excel Trendline Options.

Figure 2 – Excel Trendline Option, Polynomial to the 2nd order.

Figure 3 illustrates where we will end up with this walk-through. It shows the relationship between [Internet Sales Amount] and [Sales Amount] for the Product Subcategory, Shorts.

Miles per Gallon vs Miles Per Hour

Figure 3 – Measure correlation for “Shorts”.

The R2 value of .6615 demonstrates moderate correlation, but it is deceptive as there is a clear outlier toward the bottom (Jul-04) that is skewing the result somewhat. I left the outlier in because I can’t stress enough how this technique doesn’t take into account the removal of outliers. Figure 4 shows that removing the outlier yields an almost non-existent correlation.

Miles per Gallon vs Miles Per Hour

Figure 4 – Measure correlation for “Shorts without the outlier”. The correlation isn’t good at all without that outlier.

If you’d like to play along, open up an MDX window in SQL Server Management Studio (I’m using SQL Server 2008 R2) and  open this script, which will be described in the following paragraphs.

There are three main sets of calculations in the MDX. Figure 5 shows what I call the parameters. The three parameters mean that we are looking for the relationship between [Internet Sales Amount] and [Sales Amount] based on the months from August 2003 through July 2004. Notice that there is a line commented out for [Internet Tax Amount]. That is to test a different measure for “Y”, [Internet Tax Amount] instead of [Sales Amount]. (If you did try [Internet Tax Amount], you will be a perfect correlation since the tax amount is directly proportional to the sales.)

Figure 5 – “Parameters” of the MDX demonstrating polynomial relationship calculation.

Figure 6 shows some of the intermediate calculations for finding the “a, b, and c” (remember, y=ax2+bx+c) values of the polynomial. Again, it’s just a bunch of squaring and summing, the same old stuff I’m sure people have implemented many times, Excel included. Figure 6 doesn’t show the more important “a, b, and c” calculations because they are rather verbose and I didn’t want to include such a large snapshot.

Figure 6 – Intermediate calculations for determining the a, b, and c values of the polynomial.

Figure 7 shows the actual select part of the MDX along with the star calculation, “Relationship” (R2). This MDX will show the strength of the correlation between the [Internet Sales Amount] and [Sales Amount] for each product subcategory.

Figure 7 – The business end of the MDX.

Notice as well that there is a line commented out for the customer Gender/Education level. You can try this out after this walkthrough focused on product subcategory.

If you are playing along, I should mention there are two queries in the script, this query and a test query. Be sure to highlight the one you intend to run.

Executing the MDX will yield what is shown in Figure 8 (partial results). The Relationship values show a value from 0 through 1, where 0 is absolutely no correlation and 1 is perfect correlation.

Figure 8 – Result of the MDX. The “Relationship” column shows the R2 value.

I’ve highlighted (red circle) “Shorts” as the product subcategory we will test. Notice though that there are many quirky values:

  • Lights and Locks show a value of 1.000, a perfect correlation. However, that’s because all of the values are null.
  • Mountain Frames shows -1.#IND. In this case, the Internet Sales Amount for all months are null, but there are values for Sales Amount.
  • You can’t see it here, but some of the Relationship values will not match Excel. That is because for some of the product subcategories, the values for Jul-04 are null. “Mountain Bikes” are an example.

Figure 9 shows the MDX used in the first step to test the Relationship values against what Excel will calculate (as illustrated in Figure 3). Notice that I slice (WHERE clause) to return values for “Shorts”.

Figure 9 – Test the R2 value for shorts.

Figure 10 shows the month by month values used to derive the polynomial used to calculate the relationship strength.

Figure 10 – The Internet Sales Amount and Sales Amount for Shorts by month.

Follow these steps to duplicate what is shown in Figure 3 (well, almost – I did do some cleaning up):

  1. Copy/Paste the entire contents of the Result pane into an Excel spreadsheet.
  2. In the Excel spreadsheet, select just the Internet Sales Amount and Sales Amount columns.
  3. Click on the Insert tab, click the Scatter icon, and select the plain Scatter plot (the one in the upper-left corner).
  4. Right-click on any of the plotted points and select “Add Trendline”.
  5. Select Polynomial and check the “Display R-Squared value on Chart” and “Display Equation on chart” items. Close.

Limitations

Implementing these calculations into the MDX script is easy for the most part. Just add the calculations, setting the appropriate visibility. What will be clumsy is dynamically selecting the measures for X and Y. There isn’t a straight-forward way to select two measures from most cube browsers. My only thought right now would be to set up two pseudo Measures dimensions where each member is mapped to a real measure (using SCOPE). Then we can select x and y from those dimensions. That’s a blog in itself.

Additionally, in “If You Give a Mouse a Cookie” fashion, after you begin playing with relationships, you’re going to want to:

  • Drill down to the details of the correlation set, as we did in the example.
  • Select the same hierarchy across rows and columns. For example, we may want to cast for the correlation between the [Internet Sales Amount] for each product and the associated ad campaign cost (assuming product-level costs exist).
  • Handle outliers.
  • Have more control over the nuances of the correlation algorithm (from a calculation and performance point of view) than is allowed through MDX.

Those are in fact among the initial thoughts I had a year and a half ago when I first created the Visual Studio 2010 project for Map Rock. Please do take a look at the Map Rock Problem Statement for much more on those thoughts.

From a performance point of view, each cell involves many calculations, so the number of cells calculations are many. The good news is that the calculations aren’t the sort that generates thousands of “Query Subcube” events. Currently, the MDX is pretty snappy (even on cold cache), but modifications to handle the quirks I described in the walk-through would have noticeable effects.

About Eugene

Business Intelligence and Predictive Analytics on the Microsoft BI Stack.
This entry was posted in BI Development, SQL Server Analysis Services and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s