Optimising Aggregations in Analysis Services 2008

In Analysis Services 2008, query and process performance can be optimised by looking at aggregations. I’ve distilled the main points from the SQL Server Best Practices guide in order to obtain the main points, and added in some other details. To add this post to Twitter, please click here.

An aggregate is a data structure that contains a pre-calculated summary of data. The aggregation size is smaller than the fact data it is based on, so it is expected that this will be quicker to query. Thus, aggregation design is an important method of optimising performance.

However, it is important to balance this benefit against the cost of updating the aggregations at processing time. Aggregations designs add metadata, and can thus impact on the performance. Thus it is sensible to have a balanced approach to aggregation designs per measure group; not too many! The number of potential aggregations is very high, so it would take a lot of space to contain all aggregations; thus, it is important to be selective when creating aggregations. It is not necessary to build lots of aggregations to get a query performance increase. A query can be resolved quickly since it can use existing aggregations, and combine them, to obtain the query result. It isn’t possible to aggregate everything e.g. country code. It is possible to set IsAggregatable to False, but this will affect all of the default members; so this step should be taken with care.

There are two main types of aggregations; rigid and flexible. Rigid aggregations are preserved as a result of dimension update; flexible aggregations are re-calculated. It is possible to force flexible aggregations to be rebuilt by using ProcessAffectedObjects when processing the dimension, or you can use the ProcessIndex command.

Aggregations can be created manually, or it is possible to run the Aggregation Design Wizard from BIDS or SSMS. It uses an estimated count of the members of all attributes of dimensions, in the corresponding measure group. If the EstimatedCount property is not set, then SSAS will query the relational database; this may increase the time taken for the wizard to run, particularly if the dimensions are relatively large in terms of attributes.

It is possible to tag attributes as a candidate for aggregation. This is done using the AggregationUsage property. Before you do this, you should look at the hierarchy of attributes in order to identify whether the hierarchies can be optimised. If the dimension has natural hierarchies, then it is possible to include only one attribute from the same dimension to create the necessary aggregations; for example, you could just include Product SubCategory, and this would be rolled up to give ProductCategory to prevent ProductCategory from becoming redundant. If the hierarchy of attributes is not natural, then you may need more aggregations to cover off the potential space of queries and it isn’t possible to re-use the smaller atomic aggregations. Further, you would need to include more than one single attribute to obtain the aggregation; thus, more aggregations may be required in order to optimize performance.

The AggregationUsage property applies to attributes, and identifies the activity that the Aggregation Design wizard can take on each attribute. It can take one of four values:

Value

Description

Full

Every aggregation should include

(a) this attribute or

(b) a related attribute, which is lower down in the attribute chain.

Unrestricted

The aggregation designer will consider this attribute to see if it is eligible for aggregation.

None

No aggregation applied to this attribute.

Default

Default rule applied, based on type of attribute and dimension. There are four rules:

Default Constraint 1

Unrestricted

Default Constraint 2

None for Special Dimension Types – data mining, many-to-many

Default Constraint 3

Unrestricted for Natural Hierarchies – unrestricted unless the attributes are nonaggregatable, in which case this property is set to Full

Default Constraint 4

None

Tips:

  • Set the AggregateUsage property to Unrestricted only for the frequently-used attribute hierarchies: The Best Practices paper suggests five – ten Unrestricted attributes per dimension.
  • If the attribute is contained in most of the queries, then set AggregateUsage should be set to Full. This only applies to attributes where there are only a few members; otherwise, the wizard will take a long time to devise a design, and implement it.
  • If the attribute isn’t used often, then this should be set to None. This reduces the overall search space.

Aggregations are partition-specific, since they are data structures stored on disk. The AggregationDesign object can be shared by different partitions. It is also possible to use a base partition which does not contain any data, but does contain the aggregation design. When a new partition is required, a clone of the base can be taken, and then fully processed, and will contain the new data and corresponding aggregations.

Another Wizard which can help is the Usage-Based optimisation wizard, which can be run via SSMS. The wizard works by evaluating the queries in the query log, identifies the hundred queries that performed the slowest, and devises aggregations to resolve these. It works in a top-down fashion; it starts with grand totals, and moves down to lower grain aggregations. If the wizard is configured to design aggregations to give a 100% performance again, then the aggregations will be devised to avoid directly hitting the partition. The resulting aggregations may be identical to existing aggregations, so the user will need to double-check for duplicates. In order for it to work, it is necessary to set up server query logs:

QueryLogSampling

Every 10th query is logged, but this is configurable

QueryLogConnectionString

The SQL Server holds a query log for SSAS

QueryLogTableName

This holds the tablename for holding the log

CreateQueryLogTable

This creates a new table, or logs to the existing one

How is it possible to measure the optimisation value of aggregations? It is possible to use SQL Profiler to identify which aggregations are used, by making use of the event GetDataFromAggregation. Examine statistics and metadata about the cube, such as member counts or fact table counts. The design optimisation algorithm will use this information when suggesting aggregations. If there is no aggregation, then the storage engine goes to the detailed data and then executes the appropriate MDX.

In the profiler, each dimension is named by a vector, which identifies whether the attribute refers to the All attribute, or the attribute itself. The All level is identified by 0, and the attribute level is identified by 1. For example, the following subcube definition (0000, 0001, 0001) describes an aggregation for the following:

  • Product – All, All, All, All
  • Geography – All, All, All, State/Province
  • Order Date – All, All, All, Year

Each aggregation is identified by a subcube. This is a vector path, which is an amalgamation of the dimension vectors, which are comma-separated. The vectors are combined in dimension order, which can be found in the XMLA definition of the cube, or by looking under the Dimensions pane in the Cube Structure.

To summarise, in Analysis Services 2008, query and process performance can be optimised by looking at aggregations; there are many different features around Aggregations in 2008, so it is worthwhile having a look at them.

Comparison between Parent-Child and User Hierarchies in Dimensions

In SQL Server Analysis Services 2008, hierarchies are important because they are critical to optimising dimensions, and thus improving the performance of the solution as a whole. Hierarchies are collections of levels, which allow users to browse cubes in a way that they will find easy to understand.

Some hierarchies are natural; this means that the attributes are intuitively related to one another. There is a clear relationship from the top of the hierarchy to the bottom. An example of this would be date: year, quarter and month follow from each other, and in part, define each other. SSAS 2008 stores the information about natural hierarchies in hierarchy stores, and these are aggregated. Other hierarchies are unnatural; this means that the attributes are not clearly related. An example of this might be geography; we may have country -> state -> city, but it is not clear where Province might sit.

By clarifying the relationships in the data, then this will help to optimise performance. Attribute relationships help performance by building indexes, and reduce query time since it is not necessary to go through the key attribute. When the dimensions are built, the engine has an increased search space of possibilities since it does not know how attributes are related to each other. The increased search space obviously adversely impacts query time. By defining attribute relationships, the overall search space is reduced since indexes are built in accordance with the attribute relationships. Further, aggregations can be used, and re-used, where attributes are related. By tagging certain attributes as having aggregations, then it is possible to optimise the performance at query time, since the aggregations are pre-calculated.

There are different types of hierarchies: Parent-Child and User-defined Hierarchies. User-defined Hierarchies allow users to create hierarchies with their own levels, and relationships between the attributes. On the other hand, Parent-Child hierarchies use a special attribute, called a parent attribute, to determine how members relate to each other. This configuration depends on table columns. A parent attribute is a self-referencing relationship, or self-join, with the main dimension table; one is the parent column, and the other is the member key column.

Next, I have tabulated the main points between Parent-Child and User Hierarchies in dimensions:

Parent-Child Hierarchy

User Hierarchy

Definition

The hierarchy depends on two table columns:

A member key

A parent column

Collection of levels based on attributes.

Hierarchy Construction

Hierarchy is constructed in two ways:

Examine underlying dimension table data

Examine parent-child relationships between records in the table.

Grouping and Ordering attributes

Name determination

NamingTemplate property

Attribute Names

Display

MemberswithData property set to NonLeafDataVisible or NonLeafDataHidden

Attribute Names

Hierarchical position of member

KeyColumns and RootMemberIf property of the parent

Attribute Ordering

Position of member in level

OrderBy attribute; this is considered to be an unbalanced hierarchy

Attribute Ordering; the hierarchy can be balanced / unbalanced / ragged

Data Type

The Parent and Child must have the same data type

Flexible; no restriction to have the same data type.

The implementation of hierarchies will help to optimise the performance of the SSAS 2008 solution as a whole, so it is worth taking some time over it. Once again, the SQL Server Best Practices guide has some very useful information.

Type 1 and 2 Relationships and Dimension Design in SSAS 2008

In the current blog, there is a summary of Type 1 and Type 2 attribute dimension changes in terms of their implementation of attribute relationships in Analysis Services 2008. Attribute relationships are an important part of dimension design for a number of reasons:

  • Correct attribute relationships help the server to optimize data storage
  • Define referential integrity within the dimension itself
  • Organise member properties

According to the SQL Server Best Practices Whitepaper, there are some optimisation guidelines for dimensions:

  • Create attribute relationships that follow the natural hierarchies found in the data
  • Set the RelationshipType property of the attribute relationships to Rigid when the data does not change between related attributes, and to Flexible where the data does change between related attributes.

When you set the property RelationshipType correctly, then the server is able to do its job of optimising the change processing methods, and aggregation re-building. In terms of aggregations, SSAS will persist aggregations during cube processing if the attribute relationship is considered to be fixed, or Rigid. If it is set to Flexible, then the aggregations will need to be processed again.

Briefly, where the relationships between attributes are not expected to change, the property RelationshipType should be set to Rigid. In contrast, RelationshipType should be set to Flexible where the attribute relationship is considered to be subject to change. By default, the RelationshipType is set to Flexible. So, for example, the relationship between Year and Quarter is not likely to change, so the RelationshipType would be set to Rigid. On the other hand, the marital status of a customer may be subject to change, so this RelationshipType would be set to Flexible.

This sounds straightforward, but becomes more complex when we consider Type 1 and Type 2 attribute changes, as you will see below.

If you’re tracking Type 1 historical changes of an attribute, then this is normally set to Flexible if the key and the name are changed at the same time, normally during ETL load. So, if a ProductCategory key and name change occurs at the same time, then this would be Flexible. One exception of this is that if the name changes, but the key does not, then it could be set to Rigid. For example, if the ProductCategory name changes, but its key does not, then this will set to Rigid; the name will simply be the new name for the ProductCategory.

If you’re tracking Type 2 historical changes of an attribute, such as Product Name that’s tied directly to the key attribute, then you change the RelationshipType property to Rigid.

If your Type 2 historical change is happening to an attribute that’s not directly related to the key attribute but is related instead to a higher level attribute, then the RelationshipType property needs to remain Flexible. Otherwise, you’ll generate an error when the dimension is re-processed.

To summarise, although the Rigid/Flexible definition appears straightforward at first, it is important to keep in mind the type of attribute change that is being made.

Dimensional Modelling: application to the life of King Henry VIIIth

Here, I will discuss the various types of Slowly Changing Dimensions have been discussed using the example of the marital activities of King Henry VIIIth, along with various suggestions of how the SCD technology might be used to implement these technologies using Microsoft SSIS 2008.

When we model environments, how is it possible to record a piece of information when characteristics about that fact are subject to change? The current article hopes to clarify this issue, by using an example: King Henry VIIIth (28 June 1491 – 28 January 1547) and his marital history. King Henry the eighth is mostly remembered for having six wives, and then disposing of them in different ways. People often assume that he divorced five times and his sixth wife outlived him. However, that’s not strictly true; four marriages were annulled, and the remaining two ended in death. The specific details are listed here:

Consort

How did the Marriage end?

Catherine of Aragon

Annulled

Anne Boleyn

Annulled then beheaded

Jane Seymour

Died in childbirth

Anne of Cleves

Annulled

Catherine Howard

Annulled then beheaded

Catherine Parr

Outlived King Henry VIIIth

Considering that King Henry VIIIth is someone who had a complexity of different marital status at various times, how could this behaviour tracked in a data warehouse? In more general terms, how can dimensional modeling cope with changes such as these?

Data about facts are held in dimension tables. The ‘look up’ data is held in dimensional tables, and the actual facts themselves are held in a fact table. How can dimensional tables be configured to cope with data or ‘real life’ changes?

Dimensional modeling implements a methodology called ‘Slowly Changing Dimensions’ or SCD in order to accommodate ‘real life’ changes about facts. This article also briefly describes how SCD methodology is implemented in Microsoft SSIS 2008.

There are three main types of commonly used SCD: 1, 2, and 3. Each type of SCD tries to allow the dimensional designer to cope with changes.

The Type 1 methodology simply overwrites old data with new data. Historical data is not tracked at all. This can be used when data errors are being corrected, for example, if an attribute was mis-spelled. One criticism of this rationale for SCD is that it does not allow the state change reversal to the original state. In dimensional modelling, this means that the original information is lost. This destructive behaviour on the original record may not be suitable for many business environments, which may need to preserve the history of the state over time. In Microsoft SSIS, the Type 1 implementation of the Slowly Changing Dimension transformation directs these rows to an output named Changing Attributes Updates Output.

In our example of King Henry VIIIth, a Type 1 methodology would hold that, at the time of his death, he was married and that no other history exists. However, for many purposes, this might be insufficient. For example, what about the case where we would like to track the history regarding the King Henry VIIIth’s marital status?

In a Type 2 methodology, in terms of data warehouses, existing records cannot be changed. However, it is possible to allow an update to show that the record is current or expired. A change in an attribute results in the creation of new records, which have a primary key. Thus, a new record is created as a result of an attribute change as opposed to the update of an existing record. Since a new record is inserted each time a change is made, then the history is preserved. For example, a customer could have an address in Oxford in one year, and an address in Cambridge the next. Although the customer is only one person, according to the dimensional model the customer would have two different records, one for Oxford and the other for Cambridge.

Briefly, a new situation or record is created to correspond to the new state. In terms of the data warehouse, this means that we would have to create a key that ties the multiple instances together. This key is known as the Business key. In Type 2 SCD, each record shares the same Business Key, which ties the events together, even though they have a different Surrogate Key. In terms of future ‘worlds’, future events against a dimension record will be owned by the latest dimension record that changed; so a relation will exist. The fact record would be recorded against the latest dimension value, while historic records would link to the old previous record. Thus, a data update occurs, not a data overwrite.

In SSIS, the Type 2 implementation of Slowly Changing Dimension transformation directs these rows to two outputs: Historical Attribute Inserts Output and New Output. Fixed attribute changes indicate the column value must not change. The Slowly Changing Dimension transformation detects changes and can direct the rows with changes to an output named Fixed Attribute Output.

In our example of the marital complexities of King Henry VIIIth, the history would be tracked by a Type 2 Slowly Changing Dimension. The effective start and end dates would keep the history of the changes in marital history, which would be preserved.

In a Type 3 methodology, this involves a combination of the previous two interpretations. The Type 3 SCD can be expanded to include Type 2 functionality. At its lowest level, it is a case of duplicating attributes to have a ‘Current’ and ‘Previous’ version, and then using logic to set values where needed. In other words, the only outcome that is considered is the outcome that actually took place; this can be called the only true or real outcome.

This might be useful for tracking both the previous state and the existing state. For example, if the report consumer wanted to know the current status and the previous status, then this might be useful. However, this is rarely the case. In our example of King Henry VIIIth, Type 3 SCD would be used if we wanted to know his current state and his previous state. However, this isn’t necessary for most purposes, which is why Type 3 isn’t used often.

According to Type 3 SCD, the original value is overwritten and the new effective start date is added. The old record is changed, which is denoted by an updated to a column to specify the effective start date. We don’t have to worry about all changes. In fact, we are only concerned with the facts on which we are trying to report.

Microsoft SSIS does not support Type 3 changes because it involves changes to the dimension table. However, it is possible to capture the data values that are candidates for Type 3 changes by identifying columns with a fixed attribute update type.

Microsoft SSIS does permit the implementation of Type 1 and Type 2 SCD, but care needs to be taken with the implementation of Type 3 methodology. Microsoft SQL Server SSIS 2008, via package logging, does permit the additional logging of calls to the SCD transformation, which allows increased capacity to troubleshoot connections, commands and queries.

To conclude, the various types of Slowly Changing Dimensions have been discussed using the marital shenanigans of King Henry VIIIth, along with various suggestions of how the SCD technology might be used to implement these technologies.