Thoughts Around “Aggregate Fact Tables”

I’ve encountered many situations where some set of complications in a cube model were eased by using an aggregation table derived from some base fact table as the fact table. The complications leading to that choice usually include expensive on-the-fly calculations, distinct count measures, or even many to many relationships.

The main idea is that we can avoid the query-time calculations by performing these calculations once, capturing these calculations in a database table, and using that table as the fact table. For example, a calculated measure could be something like a KPI status calculation which compares an employee versus the average of the group. If we were to view a large number of employees, calculating those results could take several seconds under best conditions. But that may translate into very many seconds if the SSAS server is under high utilization. Pre-calculating those KPIs by every conceivable way it could be sliced and diced would yield very quick responses.

Another good example of a situation for an aggregate table is where there is a many to many relationship associating dimensions with large numbers of members. This means that the intermediate fact table (the association table) could be very large. Under these conditions performance can be very sluggish. Applying the Many to Many Compression Algorithm can result in outstanding improvements (by a magnitude in some cases). However, a pre-aggregated fact table can usually still outperform it. Additionally, there are times that the algorithm cannot be effectively applied (see note #1 below).

There is one really big caveat to using this technique and some lesser drawbacks: Data explosion. I’ve seen fact tables of a few hundred million rows of a base fact table balloon to an aggregation table of tens of billions or even hundreds of billions. At tens of billions of facts, cube processing can take an exceptionally long time, at hundreds of billions, sometimes days.

Generally, these calculations are not additive measures, which means there needs to be one row for each combination of attributes for which we may want to view data. Essentially, we take on the burden of maintaining aggregations and all the complexities around it from SSAS. It won’t take many iterations of maintaining these tables to appreciate what SSAS does for us in regard to aggregations.

In fact, the main point of an OLAP engine is to encapsulate all the ugliness involved with maintaining and optimizing the usage of sets of aggregate (one for each granularity) tables so analytic (SQL) queries could be tremendously sped up. Users opting for this old-school, home-brewed approached to OLAP will soon figure out (to paraphrase the great book, “If You Give a Mouse a Cookie”):

If you create an aggregation table on a fixed set of attributes, you may need more for other unique sets of attributes. Eventually you will have so many that you must set up a mechanism to determine which are most important. Then you will need to have a mechanism to create them on the fly if you happen to need one for which there is no table. You will then realize there are many optimizations you can implement such as bitmaps quickly determining what tuples do not exist, pre-caching more data even if you didn’t explicitly ask for it, etc, etc.

I could go on and on, but you get my point. An OLAP engine is the encapsulation of a ton of optimization and maintenance techniques and algorithms around the bread-and-butter analytical patterns (slicing and dicing data in countless ways). That’s a ton of work and refinement, which will continue to progress. (See my counter-point in note #2 below.) There is a whole lot more than 90% of SSAS practitioners will need to know. SSAS is an ingenious engine and the best developers I’ve ever known (and I know very many) are those who designed and developed it.

Usually, the customer can live the consequences of an aggregation table size of ten or twenty billion rows. But I always ask if it’s likely there will be other attributes added, which will exponentially explode the number of rows. It’s important to remember that BI customers usually think of new things to analyze by once the juices get flowing with whatever is currently in the cube. That’s a very good sign your efforts are paying off for the customer. BI is a learning process; we usually figure out things we didn’t think about as we perform analysis. So this usually ends up not being a scalable solution.

When deadlines and budget overruns are looming, factors such as scalability (people often say in very sarcastic tones, “those ‘abilities’”) can easily take a back seat. It’s easy to convince yourself that another dimension will never be added or there will never be a request to go from three to ten years of history. Future events, no matter how probable and significant, just cannot compete with “clear and present” issues.

Other drawbacks include:

Maintenance complexity – Each granularity by which a query is issued would need a SCOPE to fetch the proper tuple (row) and a new SQL INSERT statement to insert that new granularity.
Limited ability to leverage “block computing” – Every query will be fetched cell by cell.

That really isn’t so bad considering the query performance increase. Again, the big issue is the data explosion potential of the aggregated fact table.

As with everything else SSAS, it’s all an “it depends” question and a complex web of balancing of competing factors – a “web of trade-offs”. In general, I try to stick with the “native” SSAS features unless pushed outside of SSAS’s features since they are optimized and could be further optimized in future releases (with little or no coding changes). In this case, what to do is more a matter of “Which risk are you more willing to live with?” versus “Which is the better choice?”

Remember, the devil is in the details. No matter how much easier the coding for one solution or the unlikelihood of some catastrophic event may be, making the wrong inference can be due to just one little bit of missing information. SSAS (and many other complex servers) is rife with “with don’t know what we don’t know” situations. It takes a lot of experience to draw a conclusion where there is a decent chance it isn’t ill-informed. See my note #3.

Sometimes we are forced into a corner and there is no choice. If we’re faced with a many to many situation where the compression algorithm doesn’t help, we have no choice but to use the aggregated fact table. If we’ve optimized our calculations and configurations as much as possible and it’s still too slow, we have no choice but to reduce scope, or if we’re lucky, simply scale-up or out the hardware.

Of course, it is worth exploring if the aggregation table’s measure group can be partitioned such that it can be processed incrementally. In fact, chances are fairly good that data will not change prior to some date (actually, that’s the case for any measure group). You can get kind of fancy as well. For example, I once worked through a solution that partitioned the measure group by products that were most likely to be affected by a price change (using a time series data mining algorithm), which would affect its forecasted value. It worked pretty well, but not great as about 60% of the partitions had to be processed each month.

I think a really neat feature for SSAS would be the ability to persist “warmed cache”. Or even better, to define a cache structure for which values calculated during query time could be added and persisted. This really means we can persist Formula Engine cache and load it directly into the Formula Engine (upon startup or as cache is evicted, then reloaded) similarly to how measure group aggregations are configured and loaded into the Storage Engine; the aggregation equivalent for the Formula Engine.

Semantically, that cache is very similar to those aggregate tables, so it really doesn’t save in terms of memory usage or disk storage space (such as compression would) for exploded tables. But the ability to persist that cache would at least benefit us by avoiding the maintenance chores for modifying the tables and SCOPE statements I mentioned above and just use “native” SSAS features.

Notes:

#1 The many to many compression algorithm will not be effective in cases where there is almost a one to one relationship between both sides. A great example is an insurance policy where there is a policy holder related to dependents. This generally means a parent as policy holder and the spouse and children as dependents. It is possible that someone may belong to more than one policy. Therefore, there is a many to many relationship: a policy can have many people and people can belong to many policies.

In these cases, there will hardly be any compression since each family is unique and usually will have their own policy – roughly a one to one match.

#2 It doesn’t take much SSAS experience to figure out that OLAP doesn’t handle complex calculations as well as it handles its simple aggregation such as SUM, MAX, MIN, COUNT, etc. Even the native DISTINCT COUNT has been a pain since it first appeared. This can mean degrading from sub-second to a few seconds for a small cellset. Of course, that “few seconds” can also be minutes or many minutes if the calculations are bad enough (ex: goes down to leaf levels of big dimensions).

This could be a challenge for OLAP as the term, “aggregation” in the OLAP world, driven by the likes of complex event processing, begins to take on a wider meaning encompassing complex scores or many types; predictions, weightings, and such. For example, rather than store a record for every heartbeat over a period of a minute, we simply store a salient fact such as “erratic” or “normal”. We reduce 60-80 rows into a few “commentaries”. Imagine how slow our OLAP query would be if we needed to calculate that at query time.

If that happens, these “aggregate tables” will win out more often, further relegating OLAP’s relevance into the background.

#3 I’m very careful about quickly dismissing the concerns of my colleagues or customers. The terminology in BI is so ambiguous that you can really think you understand a comment, but it turns out that two terms that sound similar with different meanings can result in a sensible statement. “Drill down” and “Drill through” is a good example. “We need the ability to drill down to finer details” from an implementation standpoint (maybe not so much a semantic standpoint) is very different from “We need the ability to drill through to finer details”. One is a matter of setting up the proper attributes, the other could require writing reports or ASP.NET pages and impact another server.

Thoughts Around “Aggregate Fact Tables”

Published by Eugene

Leave a comment Cancel reply

Related

Published by Eugene

Leave a comment Cancel reply