How is it best to handle unaudited Fact Table Updates and Deletes?

It is more difficult to conduct updates and deletes on a fact table, than to simply insert new rows. If the source system handles the updates and deletes, then this is the preferable option since it is already in place. Unfortunately, however, this is not always the case. How is it possible to handle unaudited updates and deletes?
Updates
If the changed rows are flagged with a ‘Modified’ date column, then this means we can more easily identify the rows that have been changed. If this isn’t the case, how is it possible to go about identifying which rows have been changed? There are a few options:
Truncate and re-load the fact table at every load – this isn’t a good suggestion because some fact tables may be very frequently updated. This would not provide a good service to users, who rely on the information from the fact table.
Ask your friendly DBA to look at helping you e.g. triggers, replication, or adding a ‘Modified’ date to the source transaction data – this is a better idea. It is better to push the issue ‘upstream’ so that the data is loaded as simply into the new data warehouse as possible. This also allows the opportunity for re-use, which is always good. If you’d like to know more about using replication to find unaudited changes in the fact table, please see Duncan Sutcliffe’s excellent post here.
Deletes
It’s better to distinguish between ‘soft’ deletes and ‘hard’ deletes.
• ‘Soft’ deletes means that the row is not actually deleted, but simply flagged as deleted.
• ‘Hard’ deletes – it’s gone!
Soft deletes are obviously better; an additional column can be added as a flag, and the flag is set to denote whether the record is deleted or not.
The Kimball methodology argues that updates and deletes should be handled as transactions. However, that isn’t possible in every case, and it’s probably better to apply this strategy for incremental data such as sales. Let’s take a non-incremental example, where a company has phone record data. The phone record may hold information initially such as the start time of the call. However, any ‘on hold’ time or ‘end time’ associated with the call cannot be recorded until the call ends. Thus, the update needs to be done ‘in situ’ or there will be lots of additional rows with empty columns, all for one call.
Although the structure can be difficult to decide, Kimball usually has some good words of wisdom; the trick is knowing when to apply it to your particular circumstance and user requirements, and I do recommend reading Kimball as often as possible!

Add to Technorati Favorites

Greenfield Business Intelligence – where do we start?

Generally, the human condition means that we often don’t know when to start doing, and don’t know when to stop. I’m sure you can think of plenty of examples in life! This happens in business a lot; as an illustrative example, some managers might not be willing to stop a strategy which the evidence shows isn’t working…. conversely, they don’t know when to start a new strategy.

 
The human condition also applies to business intelligence, since it involves people. However, there is the added problem of not knowing how to start, or how to stop. Greenfield business intelligence projects might sound like a dream; completely new technology, completely new solution to design, build and test, and there’s no hangover from previous bad coding.

 
Greenfield business intelligence does give us another set of problems, however. The human condition means that users don’t know what they want. This may sound strange, but sometimes users can perhaps lack confidence to say what they think that they might need now or in the future. After all, who wants to take the blame when something goes wrong later on? User input and experience is extremely valuable in producing business intelligence solutions and it’s very important to include them early on. So, if you can’t get the user input easily, what can you do?

 
Greenfield business intelligence isn’t all about technology; there are people issues as well.  In these situations, the following advice might help:
  • As a recommended strategy, producing reports in an agile way can help users to provide feedback. This will increase their confidence in the system, in addition to feeling that their input is valuable. It wouldn’t hurt to drive home that their input is necessary to make the project a success. This means that report changes can be done in a kaikaku way – iterative, smaller changes that constitute a part of the overall whole.
  • Remember that Excel is your friend; users love Excel. Users eally appreciate being able to see something. An ideal way to do this is to produce a few mock-up reports in Excel with some made-up data. Users tend to give an almost visceral response, for example, ‘I really like that, I really need that!’ or ‘ugh, not what I wanted  at all!’ This can help focus your efforts in planning the data that goes into the data warehouse.
  • Use a ‘traffic light’ system to identify what you must have, what you should have, and what’s not necessary. In project management speak, this is sometimes called the MosCow method: Must have, Should have, Could have. This is easy for everyone to understand, and can help move the requirements to move forward quite quickly.
  • All feedback is useful; the negative and the positive. If people don’t like a report or need the data, then it’s good to be ‘lean’ and not include it at all. Why give yourself extra work to include data which isn’t necessary?
  • In Kimball’s books, this is a real emphasis on this strategy and the first few chapters are about the ways of talking to business users. If you haven’t read any Kimball, I can really recommend the wisdom you’ll find there.
 
To summarise, although getting user input can be difficult, there are some tools to use which can help to obtain it more easily. It’s better to try and get user input in the planning phase rather than after you’ve spent months working on a business intelligence solution that does not answer the business questions. If that happens, the users won’t accept the solution and it’s ultimate failure.

 
I’d love your comments and if you’ve any other advice, please do leave a comment. I’d love to learn from your experiences!

 

Add to Technorati Favorites

Dimensional Modelling Definitions

Normally I write about stuff I know about… but this blog is an exception! I am writing about this to try and teach myself something. I’m not normally a risk-taker so you’ll have to be patient if you correct me. I was reading an article, which discussed quantum mechanics, and its application to Oracle and data warehousing. This blog hopes to explore some of the points raised there, and apply the principles instead to Microsoft SQL Server 2008.

As we discussed in an earlier blog, data about facts are held in dimension tables. The ‘look up’ data is held in dimensional tables, and the actual facts themselves are held in a fact table. How can dimensional tables be configured to cope with data or ‘real life’ changes?

Dimensional modeling implements a methodology called ‘Slowly Changing Dimensions’ or SCD in order to accommodate ‘real life’ changes about facts. It is possible to correlate SCD methodology with other areas of knowledge or data representation in other philosophical realms such as quantum mechanics. There are three main types of commonly used SCD: 1, 2, and 3. Each type of SCD tries to allow the dimensional designer to cope with changes. The relationship between SCD types, quantum mechanics and related implementation in Microsoft can be found next.

The Type 1 methodology simply overwrites old data with new data. Historical data is not tracked at all, and can be used when data errors are being corrected, for example, if an attribute was mis-spelled. According to the Copenhagen interpretation of quantum mechanics, this would mean that every measurement causes a fundamental change in the state; a waveform collapse. One criticism of this rationale for SCD and the Copenhagen interpretation is that it does not allow the state change reversal to the original state. Applied to dimensional modelling, this means that the original information is lost. This destructive behaviour on the original record may not be suitable for many business environments, which may need to preserve the history of the state over time. In Microsoft SSIS, the Type 1 implementation of the Slowly Changing Dimension transformation directs these rows to an output named Changing Attributes Updates Output.

In a Type 2 methodology, in terms of data warehouses, existing records cannot be changed. However, it is possible to allow an update to show that the record is current or expired. A change in an attribute results in the creation of new records, which have a primary key. Thus, a new record is created as a result of an attribute change as opposed to the update of an existing record. Since a new record is inserted each time a change is made, then the history is preserved.

In dimensional modelling, Type 2 corresponds to the Many Worlds interpretation of quantum mechanics. It allows events to reside next to one another by catering for many possibilities or worlds, which are not correlated with one another. This is similar to the Type 2 SCD since it serves as an analogy to the situation where a new situation or record is created to correspond to the new state. The analogy breaks down, however, since the ‘Many Worlds’ interpretation holds that the scenarios are unrelated, but in dimensional modeling the records are related to one another since the records share the same Business Key, even though they have a different Surrogate Key. In terms of future ‘worlds’, future events against a dimension record will be owned by the latest dimension record that changed via the Business Key, which ties the multiple instances together; so a relation will exist.

In SSIS, the Type 2 implementation of Slowly Changing Dimension transformation directs these rows to two outputs: Historical Attribute Inserts Output and New Output. Fixed attribute changes indicate the column value must not change. The Slowly Changing Dimension transformation detects changes and can direct the rows with changes to an output named Fixed Attribute Output.

In a Type 3 methodology, this involves a combination of the previous two interpretations. The Type 3 SCD can be expanded to include Type 2 functionality. At its lowest level, it is a case of duplicating attributes to have a ‘Current’ and ‘Previous’ version, and then using logic to set values where needed. In quantum mechanics, an analogous interpretation exists which is known as Consistent Histories interpretation of quantum mechanics. This interpretation holds that various worlds or states are possible, but it does not hold that an unending variation of potential multiple worlds can be produced from these states. In other words, the only outcome that is considered is the outcome that actually took place; this can be called the only true or real outcome.

Microsoft SSIS does not support Type 3 changes easily because it involves changes to the dimension table. However, it is possible to capture the data values that are candidates for Type 3 changes by identifying columns with a fixed attribute update type.

Microsoft SSIS does permit the implementation of Type 1 and Type 2 SCD, but care needs to be taken with the implementation of Type 3 methodology. Microsoft SQL Server 2005 Service Pack 2 (SP2), via package logging, does permit the additional logging of calls to the SCD transformation, which allows increased capacity to troubleshoot connections, commands and queries.

To summarise, the SCD types don’t correlate exactly to the various quantum mechanics interpretation of events; but I learned something along the way and perhaps it might spark your interest too?

Comparison between Parent-Child and User Hierarchies in Dimensions

In SQL Server Analysis Services 2008, hierarchies are important because they are critical to optimising dimensions, and thus improving the performance of the solution as a whole. Hierarchies are collections of levels, which allow users to browse cubes in a way that they will find easy to understand.

Some hierarchies are natural; this means that the attributes are intuitively related to one another. There is a clear relationship from the top of the hierarchy to the bottom. An example of this would be date: year, quarter and month follow from each other, and in part, define each other. SSAS 2008 stores the information about natural hierarchies in hierarchy stores, and these are aggregated. Other hierarchies are unnatural; this means that the attributes are not clearly related. An example of this might be geography; we may have country -> state -> city, but it is not clear where Province might sit.

By clarifying the relationships in the data, then this will help to optimise performance. Attribute relationships help performance by building indexes, and reduce query time since it is not necessary to go through the key attribute. When the dimensions are built, the engine has an increased search space of possibilities since it does not know how attributes are related to each other. The increased search space obviously adversely impacts query time. By defining attribute relationships, the overall search space is reduced since indexes are built in accordance with the attribute relationships. Further, aggregations can be used, and re-used, where attributes are related. By tagging certain attributes as having aggregations, then it is possible to optimise the performance at query time, since the aggregations are pre-calculated.

There are different types of hierarchies: Parent-Child and User-defined Hierarchies. User-defined Hierarchies allow users to create hierarchies with their own levels, and relationships between the attributes. On the other hand, Parent-Child hierarchies use a special attribute, called a parent attribute, to determine how members relate to each other. This configuration depends on table columns. A parent attribute is a self-referencing relationship, or self-join, with the main dimension table; one is the parent column, and the other is the member key column.

Next, I have tabulated the main points between Parent-Child and User Hierarchies in dimensions:

Parent-Child Hierarchy

User Hierarchy

Definition

The hierarchy depends on two table columns:

A member key

A parent column

Collection of levels based on attributes.

Hierarchy Construction

Hierarchy is constructed in two ways:

Examine underlying dimension table data

Examine parent-child relationships between records in the table.

Grouping and Ordering attributes

Name determination

NamingTemplate property

Attribute Names

Display

MemberswithData property set to NonLeafDataVisible or NonLeafDataHidden

Attribute Names

Hierarchical position of member

KeyColumns and RootMemberIf property of the parent

Attribute Ordering

Position of member in level

OrderBy attribute; this is considered to be an unbalanced hierarchy

Attribute Ordering; the hierarchy can be balanced / unbalanced / ragged

Data Type

The Parent and Child must have the same data type

Flexible; no restriction to have the same data type.

The implementation of hierarchies will help to optimise the performance of the SSAS 2008 solution as a whole, so it is worth taking some time over it. Once again, the SQL Server Best Practices guide has some very useful information.

Type 1 and 2 Relationships and Dimension Design in SSAS 2008

In the current blog, there is a summary of Type 1 and Type 2 attribute dimension changes in terms of their implementation of attribute relationships in Analysis Services 2008. Attribute relationships are an important part of dimension design for a number of reasons:

  • Correct attribute relationships help the server to optimize data storage
  • Define referential integrity within the dimension itself
  • Organise member properties

According to the SQL Server Best Practices Whitepaper, there are some optimisation guidelines for dimensions:

  • Create attribute relationships that follow the natural hierarchies found in the data
  • Set the RelationshipType property of the attribute relationships to Rigid when the data does not change between related attributes, and to Flexible where the data does change between related attributes.

When you set the property RelationshipType correctly, then the server is able to do its job of optimising the change processing methods, and aggregation re-building. In terms of aggregations, SSAS will persist aggregations during cube processing if the attribute relationship is considered to be fixed, or Rigid. If it is set to Flexible, then the aggregations will need to be processed again.

Briefly, where the relationships between attributes are not expected to change, the property RelationshipType should be set to Rigid. In contrast, RelationshipType should be set to Flexible where the attribute relationship is considered to be subject to change. By default, the RelationshipType is set to Flexible. So, for example, the relationship between Year and Quarter is not likely to change, so the RelationshipType would be set to Rigid. On the other hand, the marital status of a customer may be subject to change, so this RelationshipType would be set to Flexible.

This sounds straightforward, but becomes more complex when we consider Type 1 and Type 2 attribute changes, as you will see below.

If you’re tracking Type 1 historical changes of an attribute, then this is normally set to Flexible if the key and the name are changed at the same time, normally during ETL load. So, if a ProductCategory key and name change occurs at the same time, then this would be Flexible. One exception of this is that if the name changes, but the key does not, then it could be set to Rigid. For example, if the ProductCategory name changes, but its key does not, then this will set to Rigid; the name will simply be the new name for the ProductCategory.

If you’re tracking Type 2 historical changes of an attribute, such as Product Name that’s tied directly to the key attribute, then you change the RelationshipType property to Rigid.

If your Type 2 historical change is happening to an attribute that’s not directly related to the key attribute but is related instead to a higher level attribute, then the RelationshipType property needs to remain Flexible. Otherwise, you’ll generate an error when the dimension is re-processed.

To summarise, although the Rigid/Flexible definition appears straightforward at first, it is important to keep in mind the type of attribute change that is being made.

Dimensional Modelling: application to the life of King Henry VIIIth

Here, I will discuss the various types of Slowly Changing Dimensions have been discussed using the example of the marital activities of King Henry VIIIth, along with various suggestions of how the SCD technology might be used to implement these technologies using Microsoft SSIS 2008.

When we model environments, how is it possible to record a piece of information when characteristics about that fact are subject to change? The current article hopes to clarify this issue, by using an example: King Henry VIIIth (28 June 1491 – 28 January 1547) and his marital history. King Henry the eighth is mostly remembered for having six wives, and then disposing of them in different ways. People often assume that he divorced five times and his sixth wife outlived him. However, that’s not strictly true; four marriages were annulled, and the remaining two ended in death. The specific details are listed here:

Consort

How did the Marriage end?

Catherine of Aragon

Annulled

Anne Boleyn

Annulled then beheaded

Jane Seymour

Died in childbirth

Anne of Cleves

Annulled

Catherine Howard

Annulled then beheaded

Catherine Parr

Outlived King Henry VIIIth

Considering that King Henry VIIIth is someone who had a complexity of different marital status at various times, how could this behaviour tracked in a data warehouse? In more general terms, how can dimensional modeling cope with changes such as these?

Data about facts are held in dimension tables. The ‘look up’ data is held in dimensional tables, and the actual facts themselves are held in a fact table. How can dimensional tables be configured to cope with data or ‘real life’ changes?

Dimensional modeling implements a methodology called ‘Slowly Changing Dimensions’ or SCD in order to accommodate ‘real life’ changes about facts. This article also briefly describes how SCD methodology is implemented in Microsoft SSIS 2008.

There are three main types of commonly used SCD: 1, 2, and 3. Each type of SCD tries to allow the dimensional designer to cope with changes.

The Type 1 methodology simply overwrites old data with new data. Historical data is not tracked at all. This can be used when data errors are being corrected, for example, if an attribute was mis-spelled. One criticism of this rationale for SCD is that it does not allow the state change reversal to the original state. In dimensional modelling, this means that the original information is lost. This destructive behaviour on the original record may not be suitable for many business environments, which may need to preserve the history of the state over time. In Microsoft SSIS, the Type 1 implementation of the Slowly Changing Dimension transformation directs these rows to an output named Changing Attributes Updates Output.

In our example of King Henry VIIIth, a Type 1 methodology would hold that, at the time of his death, he was married and that no other history exists. However, for many purposes, this might be insufficient. For example, what about the case where we would like to track the history regarding the King Henry VIIIth’s marital status?

In a Type 2 methodology, in terms of data warehouses, existing records cannot be changed. However, it is possible to allow an update to show that the record is current or expired. A change in an attribute results in the creation of new records, which have a primary key. Thus, a new record is created as a result of an attribute change as opposed to the update of an existing record. Since a new record is inserted each time a change is made, then the history is preserved. For example, a customer could have an address in Oxford in one year, and an address in Cambridge the next. Although the customer is only one person, according to the dimensional model the customer would have two different records, one for Oxford and the other for Cambridge.

Briefly, a new situation or record is created to correspond to the new state. In terms of the data warehouse, this means that we would have to create a key that ties the multiple instances together. This key is known as the Business key. In Type 2 SCD, each record shares the same Business Key, which ties the events together, even though they have a different Surrogate Key. In terms of future ‘worlds’, future events against a dimension record will be owned by the latest dimension record that changed; so a relation will exist. The fact record would be recorded against the latest dimension value, while historic records would link to the old previous record. Thus, a data update occurs, not a data overwrite.

In SSIS, the Type 2 implementation of Slowly Changing Dimension transformation directs these rows to two outputs: Historical Attribute Inserts Output and New Output. Fixed attribute changes indicate the column value must not change. The Slowly Changing Dimension transformation detects changes and can direct the rows with changes to an output named Fixed Attribute Output.

In our example of the marital complexities of King Henry VIIIth, the history would be tracked by a Type 2 Slowly Changing Dimension. The effective start and end dates would keep the history of the changes in marital history, which would be preserved.

In a Type 3 methodology, this involves a combination of the previous two interpretations. The Type 3 SCD can be expanded to include Type 2 functionality. At its lowest level, it is a case of duplicating attributes to have a ‘Current’ and ‘Previous’ version, and then using logic to set values where needed. In other words, the only outcome that is considered is the outcome that actually took place; this can be called the only true or real outcome.

This might be useful for tracking both the previous state and the existing state. For example, if the report consumer wanted to know the current status and the previous status, then this might be useful. However, this is rarely the case. In our example of King Henry VIIIth, Type 3 SCD would be used if we wanted to know his current state and his previous state. However, this isn’t necessary for most purposes, which is why Type 3 isn’t used often.

According to Type 3 SCD, the original value is overwritten and the new effective start date is added. The old record is changed, which is denoted by an updated to a column to specify the effective start date. We don’t have to worry about all changes. In fact, we are only concerned with the facts on which we are trying to report.

Microsoft SSIS does not support Type 3 changes because it involves changes to the dimension table. However, it is possible to capture the data values that are candidates for Type 3 changes by identifying columns with a fixed attribute update type.

Microsoft SSIS does permit the implementation of Type 1 and Type 2 SCD, but care needs to be taken with the implementation of Type 3 methodology. Microsoft SQL Server SSIS 2008, via package logging, does permit the additional logging of calls to the SCD transformation, which allows increased capacity to troubleshoot connections, commands and queries.

To conclude, the various types of Slowly Changing Dimensions have been discussed using the marital shenanigans of King Henry VIIIth, along with various suggestions of how the SCD technology might be used to implement these technologies.