Snags of the Week – January 9, 2010

Within my 60 or so hours per week of work utilizing a wider range of BI-related technologies to a deep level that pushes my brain to its limits, I run into my share of snags. Some are genuinely crazy things I never would have guessed while others leave me wondering how I didn’t know that. Well, the fact that BI is about gaining an all-encompassing view from a wide array of data sources means a BI developer wears many diverse hats. Which means there are many ways to trip up in that tangled web as I page things in and out of the limited RAM in my brain. So I thought I’d try to write up my favorite snags every week or two or three in a short, painless blog.
 
I don’t know that I’ve come up with the best solutions to these snags as I’ve usually spent way more time that I wanted on them and cannot justify spending more time on researching a solution beyond "good enough" under the time and money constraints. However, I hope to at least provide a couple of ideas and maybe hear something back from folks outside on better ways.
 
I hope to keep up this series of blogs as I do run into some neat things. But it does take a significant amount of time to write …
 
CHECK_IDENT with DELETE and TRUNCATE
 
I wrote a little SQL Server script last week that inititalizes a new data warehouse and seeds it with some reference data. Because it "initializes", I delete everything from many tables. Normally, I use TRUNCATE rather than DELETE since I am deleting everything and TRUNCATE is much more efficient at deleting everything. But for some reason, this time I decided to use DELETE on a couple of the tables and it revealed a SQL Server quirk I had not known about. The quirk is that the behavior of the CHECK_IDENT DBCC command for re-seeding an IDENTITY column works differently with DELETE and TRUNCATE.
 
To demonstrate this, consider this script which creates a new table with an IDENTITY column, inserts a couple of rows, shows the results, then shows the results after TRUNCATEing and INSERTing and DELETEing and INSERTing:
 

CREATE TABLE dbo.IdentTest ([ID] INT IDENTITY(1,1),[Description] VARCHAR(20))

INSERT INTO dbo.IdentTest ([Description]) VALUES(‘Item 1’)

INSERT INTO dbo.IdentTest ([Description]) VALUES(‘Item 2’)

–Return rows after first INSERT.

SELECT * FROM dbo.IdentTest

TRUNCATE TABLE dbo.IdentTest

DBCC CHECKIDENT (‘dbo.IdentTest’, RESEED, 1)

INSERT INTO dbo.IdentTest ([Description]) VALUES(‘Item 1’)

INSERT INTO dbo.IdentTest ([Description]) VALUES(‘Item 2’)

–Return rows after TRUNCATE and RESEED=1.

SELECT * FROM dbo.IdentTest

DELETE FROM dbo.IdentTest

DBCC CHECKIDENT (‘dbo.IdentTest’, RESEED, 1)

INSERT INTO dbo.IdentTest ([Description]) VALUES(‘Item 1’)

INSERT INTO dbo.IdentTest ([Description]) VALUES(‘Item 2’)

–Return rows after DELETE and RESEED=1.

SELECT * FROM dbo.IdentTest

 

This picture below shows the results of the three SELECTS:

 
 
The result of the first SELECT shows the ID starting at 1 as we would expect. The second shows the results after truncating the table and reseeding the IDENTITY column with 1. That started with 1 as well and expected and since I had always used TRUNCATE if I really want to remove all rows, that’s all I’d ever seen. However, the third result shows what happened after using DELETE to remove all rows. The IDENTITY column starts from 2, which is the reseed value plus 1. You could alternatively RESEED with 0 when using DELETE, but I would just use TRUNCATE.
 
I think I decided to use DELETE because I originally didn’t want to remove all of the data upon intitialization. I was trying to make this script smarter in not just doing an all-out initialization, but optionally something in between.
 
Missing Columns Using MDX in a TSQL OPENQUERY
 
I had a "turning of the year" (Y2K-like) problem this past week. It appeared in a script that I’d written last Fall that performs about thirty SQL statements using the results of an OPENQUERY function which issues an MDX statement. All but two of these queries return roughly the same rows (the ON ROWS axis of the MDX statement). Those queries take a relatively long time to process (in the minutes) as they perform crossjoins and incorporate many "rules" (complex calculations and SCOPEs in the MDX script). At the time I wrote the script, my focus was completely on optimization. It seemed at the time taking care of a problem that shouldn’t was very unlikely to happen (and if it did, not until later) placed it low on the priority list … which we all know means it will probably fade away until it bites me in the ass someday.
 
The problem I put off is the result of the MDX in the OPENQUERY returning no rows. If no rows are generated, the only "columns" returned by the flattened rowset of the MDX are the columns in the COLUMNS axis of the MDX. The flattened rowset will not include as columns the attribute names of the attributes in the ROWS axis. Since my outer SQL specified the columns, the empty rowset I got last week resulted in a SQL Server error complaining about invalid columns.
 
To illustrate, consider a situation where a group of marketers are using a planning application. One of the features is to return a list of the most profitable customers in a particular state. However, we also want to see their total sales across all geographies, in order to assess how marketing efforts may cannabilize sales elsewhere.
 
This MDX based on [Adventure Works 2008 DW] is such a query:
 

SELECT

      {[Measures].[Internet Sales Amount],[Measures].[Internet Gross Profit]} ON COLUMNS,

      FILTER

      (

            ORDER

            (

                  [Customer].[Customer].[Customer].Members,[Measures].[Internet Gross Profit],DESC

            ),

            ([Measures].[Internet Gross Profit],[Geography].[State-Province].&[CA]&[US])>1000

      ) ON ROWS

FROM

      [Adventure Works]

 
However, because the application combines data from a relational database with the OLAP data, this MDX is called from a SQL Server stored procedure with a SQL that executes this MDX in an OPENQUERY function:
 

SELECT

      CAST([[Customer]].[Customer]].[Customer]].[MEMBER_CAPTION]]] as NVARCHAR(50)) as [Customer],

      CAST([[Measures]].[Internet Sales Amount]]] AS FLOAT) as [Internet Sales Amount],

      CAST([[Measures]].[Internet Sales Amount]]] AS FLOAT) as [Internet Gross Profit]

FROM OPENQUERY(AdventureWorksDW,

‘SELECT

      {[Measures].[Internet Sales Amount],[Measures].[Internet Gross Profit]} ON COLUMNS,

      FILTER

      (

            ORDER

            (

                  [Customer].[Customer].[Customer].Members,[Measures].[Internet Gross Profit],DESC

            ),

            ([Measures].[Internet Gross Profit],[Geography].[State-Province].&[CA]&[US])>1000

      ) ON ROWS

FROM

      [Adventure Works]’)

This is the result of that query:

 
If you notice, in the SQL part of the query I specify the columns and don’t use an asterisk to simply specify all columns. There are a few reasons for this:
 
  1. I can reformat the column name, which will be that ugly name you see with all those brackets if I use an asterisk instead of specifying the columns. By the way, use the asterisk first so you can see the exact names of the columns.
  2. The DBAs at this customer site do not like asterisks because it does mask some problems that may show up later.
  3. I can cast the column types to my needs.
However, if no rows are returned from the MDX, I will receive this error:
 

Msg 207, Level 16, State 1, Line 2

Invalid column name ‘[Customer].[Customer].[Customer].[MEMBER_CAPTION]’.

You can try this by changing this part of the query above from:
 

    ([Measures].[Internet Gross Profit],[Geography].[State-Province].&[CA]&[US])>1000

 

to
 

    ([Measures].[Internet Gross Profit],[Geography].[State-Province].&[CA]&[US])>1000000

 
which will not return any rows. The reason is that there are no customers with gross profits in CA of over $1,000,000. Therefore, the ROWs axis resolves to an empty set. Because its an empty set, the SQL will not have any metadata on the dimension attributes comprising the ROWS axis. It will only see the members of the COLUMNs axis. I did try to figure out if there is some hardly known way to force the MDX to return metadata on the dimension attributes (at least the column names), but I didn’t find anything … at least in the time I thought was reasonable to search.
 
There are several ways I thought of to address this problem, but I went with running a minimal version of the MDX in prior query to test how many rows are returned. In this prior query, I use an asterisk in the SQL part to avoid getting the error:
 

DECLARE @Rows BIGINT

SET @Rows=(

      SELECT COUNT(*)

      FROM (

            SELECT *

            FROM OPENQUERY(AdventureWorksDW,

           

            WITH MEMBER [Measures].[Filler] as 1

            SELECT

                  {[Measures].[Filler]} ON COLUMNS,

                  TOPCOUNT(

                        FILTER

                        (

                              [Customer].[Customer].[Customer].Members,

                              ([Measures].[Internet Gross Profit],[Geography].[State-Province].&[CA]&[US])>1000

                        )

                        ,1) ON ROWS

            FROM

                  [Adventure Works]’)

            ) as t

      )

 

The query will set the @Rows variable to the number of rows returned. This value could be used to avoid running the real query or at least take some logical approach to avoid the error. Notice that I tried to minimize the MDX as much as possible:
 
  • Removed the ORDER function. We only care if any customers have more than $1000 is gross profit in CA.
  • Removed the Measures in the COLUMNS axis. I created a calculated Measure with a value of 1. This avoids fetching any cell values.
  • Used TOPCOUNT to avoid rendering a bunch of cells. All we need to know is are there any or no rows?.
Another option I considered is to create a calculated Measure for each column in the SQL SELECT list. Instead of using the ROW axis’ attribute names, we would create a calculated measure for each of those attributes:
 

SELECT

      CAST([[Measures]].[Customer]]] as NVARCHAR(50)) as [Customer],

      CAST([[Measures]].[Internet Sales Amount]]] AS FLOAT) as [Internet Sales Amount],

      CAST([[Measures]].[Internet Sales Amount]]] AS FLOAT) as [Internet Gross Profit]

FROM OPENQUERY(AdventureWorksDW,

WITH

      MEMBER [Measures].[Customer] as [Customer].[Customer].CurrentMember.Name

SELECT

      {

            [Measures].[Customer],

            [Measures].[Internet Sales Amount],

            [Measures].[Internet Gross Profit]

      } ON COLUMNS,

      FILTER

      (

            ORDER

            (

                  [Customer].[Customer].[Customer].Members,[Measures].[Internet Gross Profit],DESC

            ),

            ([Measures].[Internet Gross Profit],[Geography].[State-Province].&[CA]&[US])>1000

      ) ON ROWS

FROM

      [Adventure Works]’)

 

 

This would work well in this situation, but there is a tradeoff. Those calculated measures do take up member since they do result in more cells. In my situation, there are about ten dimension attributes on the ROWS axis. This would be ten more cells per row, which adds up.

About Eugene

Business Intelligence and Predictive Analytics on the Microsoft BI Stack.
This entry was posted in BI Development. Bookmark the permalink.

3 Responses to Snags of the Week – January 9, 2010

  1. Davos says:

    Very interesting post.
    I found this by searching for missing columns in openquery MDX.
    I would say this situation you are desribing is more like ‘no rows returned’. I have previously used a more agricultural approach to that particular problem by creating a table variable, then using Try / Catch to insert the results of the MDX openquery into that @table and selecting from that @table. The query will always work but it makes the wrong assumption that every single error is due to there being no rows and no metadata returned to openquery. There may be other errors that occur, such as linkserver problems, but I think it’s fair to return the blank row anyway because its for an SSRS report so the worst the user sees is a blank table rather than a nasty ASP error.

    I’ve now had a situation that I would actually call ‘Missing Columns’ where I actually have 1 less column returned in the openquery MDX than expected. This is because the cell is empty in the mdx so the column is just not returned at all. The rest of the single row is there, just not that column. Whereas in SQL you get a null returned no matter what, in MDX it just chops it off which is pretty frustrating.

    I see one solution to my problem which is use a calculated member like: IIF(isempty([measure]),0,[measure]) , but if I include that the time goes from 1 second to 2+ minutes (I cancelled it because I can’t wait that long for 1 row).

    I could try and count columns using a modified version of what you’ve done here but I don’t know the column equivalent of count(*) if thats even a logical thing to say. Maybe I try run 2 reduced versions of the query to only return one of the two measures and if either are no rows then the combined one will be missing a columns so don’t run it.

  2. Eugene says:

    Hi Davos,
    First, you mention you created a calculated member that severely increases the query time: IIF(isempty([measure]),0,[measure])

    That calculated measure essentially removes any ability to take advantage of the usual sparsity of a cube (all cells for that calculated member have a value). I realize you did this to remedy your problem with null columns not returning. I thought I’d mention it anyway.

    Back to the original problem. Another trick I’ve used is to create a set of calculated members. In this scenario, I’m trying to retrieve a column for a quarter and the previous three. But sometimes, the previous three may venture beyond my lowest member. For example, if I have data ranging from Q1/2010 through Q4 2011, if I select Q2/2010, I’ll expect a column for Q3/2009, Q4/2009, Q1/2010, and Q2/2010. But Q3/2009 and Q4/2009 will be null and I’ll get only two columns back.

    So I modified my MDX from something like this:

    SELECT [Dates].[Quarters].[Q2/2010].LAG(3) : [Dates].[Quarters].[Q2/2010] ON 0

    To

    WITH
    MEMBER [Dates].[Quarters].[L3] AS [Dates].[Quarters].[Q2/2010].LAG(3)
    MEMBER [Dates].[Quarters].[L2] AS [Dates].[Quarters].[Q2/2010].LAG(2)
    MEMBER [Dates].[Quarters].[L1] AS [Dates].[Quarters].[Q2/2010].LAG(1)
    MEMBER [Dates].[Quarters].[L0] AS [Dates].[Quarters].[Q2/2010]
    SET [Quarters] AS {[Dates].[Quarters].[L3], [Dates].[Quarters].[L2], [Dates].[Quarters].[L1], [Dates].[Quarters].[L0]}
    SELECT [Quarters] ON 0

  3. Davos says:

    Hi Eugene. Good point about the sparsity of the cube and I totally agree. I think that there should be a way to do this with the OPENQUERY and MSOLAP provider rather than slowing down the MDX in such an unelegant way.

    When you create the query-scoped set as above, does it then return all column headers in the set? Is it fast (I try and avoid using the “word” performant )

    I ended up not using the IIF.

    In my problem above I had 2 financial measures we’ll call them A and B and then a third column on the report showing the total of A and B. The report was per-employee running on sharepoint and using their NT Login to automatically show the employee their own report. Notably, some employees have only A, some have only B and some have both. The SSRS expression took care of the total, but the problem was what to return in the case when either A column is missing or B column is missing.

    The report runs in a couple of seconds from a Reportviewer webpart on sharepoint, and whilst it is essentially running between 2 and 3 of a similar query, its still fast.

    This is all well and good but I’m back googling now because I have another report with 12 calculated measures that may or may not exist week to week 😦

    Anyway this is how I ended up doing that previous one. Production code is the only code!

    The (simplified to pseudocode) stored proc looks like this:

    Declare @Temp Table (with all columns including A and B) , @ARows int, @BRows int

    — build up the string
    @SQLA = ‘ select A on columns’
    @SQLB = ‘ select B on columns’
    @SQLBoth = ‘ Select A , B on colums’
    @SQLEnd = ‘, other stuff on rows’ + dateMemberParam + ‘ stuff ‘ + EmployeeLoginMemberParam

    /*test for A */
    @SQL = @SQLA + @SQLEND
    Begin Try
    Insert all columns to tempTable except for B
    End Try

    Begin Catch
    @Arows = 1
    End Catch

    /*test for B */
    @SQL = @SQLB + @SQLEND
    Begin Try
    Insert all columns to tempTable except for A
    End Try

    Begin Catch
    @Brows = 1
    End Catch

    If @Arows + Brows = 0
    Return a row of NULLS –so only 2 MDX queries ran
    Else if @Arows = 0
    Insert into the temp table except for row A — so same MDX query runs as in the test and is already cached
    Else if @Brows = 0
    Insert into the temp table except for row B — so same MDX query runs as in the test and is already cached

    Else
    Insert all including both A and B — The third MDX query runs in this case which has both columns

    Select each field from temp table casting to appropriate types with
    isnull(A,0) as float
    isnull(B,0) as float

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s