Data Vault Certification Course in London with Hans Hultgren.

Hans Hultgren (LinkedIn) and his team are holding the Data Vault Certification course (CDVDM) again in London – June 18-20.

Please tell Hans that I sent you, and you will qualify for a discount. Note that I’m not financially incentivized in any way; Hans reached out after my recent Data Vault post, and I was happy to help.

Please ping Hans over at his LinkedIn page directly, and I hope you enjoy the course. I’ll look forward to hearing your comments.

datavault6

Modelling your Data in Azure Data Lake

One of my project roles at the moment (I have a few!) is that I am architecting a major Azure implementation for a global brand. I’m also helping with the longer-term ‘vision’ of how that might shape up. I love this part of my job and I’m living my best life doing this piece; I love seeing a project take shape until the end users, whether they are business people or more strategic C-level, get the benefit of the data. At Data Relish, I make your data work for different roles organizations of every purse and every purpose, and I learn a lot from the variety of consulting pieces that I deliver.

If you’ve had even the slightest look at the Azure Portal, you will know that it has oodles of products that you can use in order to create an end-to-end solution. I selected Azure Data Lake for a number of reasons:

  • I have my eye on the Data Science ‘prize’ of doing advanced analytics later on, probably in Azure Databricks as well as Azure Data Lake. I want to make use of existing Apache Spark skills and Azure Data Lake is a neat solution that will facilitate this option.
  • I need a source that will cater for the shape of the data…. or the lack of it….
  • I need a location where the data can be accessed globally since it will be ingesting data from global locations.

In terms of tooling, there is always the Azure Data Lake tools for Visual Studio. You can watch a video on this topic here. But how do you get started with the design approach? So how do I go about the process of designing solutions for the Azure Data Lake? There are many different approaches and I have been implementing Kimball methodologies for years.

cellar

With this particular situation, I will be using the Data Vault methodology. I know that there are different schools of thought but I’ve learned from Dan Lindstedt in particular, who has been very generous in sharing his expertise; here is Dan’s website here. I have delivered this methodology elsewhere previously for an organization who have billions USD turnover, and they are still using the system that I put in place; it was particularly helpful approach for an acquisition scenario, for example.

 

Building a Data Vault starts with the modeling process, and this starts with a view of the existing datamodel of a transactional source system. The purpose of the data vault modelling lifecycle is to produce solutions to the business faster, at lower cost and with less risk, that also have a clear supported afterlife once I’ve moved onto another project for another customer.

 

Data Vault is a database modeling technique where the data is considered to belong to one of three entity types: hubs, links,and satellites:

 

  • Hubs contain the key attributes of business entities (such as geography, products, and customers)
  • Links define the relations between the hubs (for example, customer orders or product categories).

 

  • Satellites contain all other attributes related to hubs or links. Satellites include all attribute change history.

 

The result is an Entity Relationship Diagram (ERD), which consists of Hubs, Links and Satellites. Once I’d settled on this methodology, I needed to hunt around for something to use.

How do you go about designing and using an ERD tool for a Data Vault? I found a few options. For the enterprise, I found  WhereScape® Data Vault Express. That looked like a good option, but I had hoped to use something open-source so other people could adopt it across the team. It wasn’t clear how much it would cost, and, in general, if I have to ask then I can’t afford it! So far, I’ve settled on SQL Power Architect so that I can get the ‘visuals’ across to the customer and the other technical team, including my technical counterpart at the customer who picks up when I’m at a conference. This week I’m at Data and BI Summit in Dublin so my counterpart is picking up activities during the day, and we are touching base during our virtual stand-ups.

StockSnap_DotsSo, I’m still joining dots as I go along.

If you’re interested in getting started with Azure Data Lake, I hope that this gets you some pointers from the design process.

I’ll go into more detail in future blogs but I need to get off writing this blog and do some work!

Dynamic Data Masking in Azure SQL Datawarehouse

I’m leading a project which is using Azure SQL Datawarehouse, and I’m pretty excited to be involved.  I love watching the data take shape, and, for the customer requirements, Azure SQL Datawarehouse is perfect.

secret-3037639_640 Note that my customer details are confidential and that’s why I never give details away such as the customer name and so on. I gain – and retain – my customers based on trust, and, by giving me their data, they are entrusting me with detailed information about their business.

One question they raised was in respect to dynamic data masking, which is present in Azure SQL Database. How does it manifest itself in Azure SQL Datawarehouse? What are the options regarding the management of personally identifiable information?

sasint

As we move ever closer to the implementation of GDPR, more and more people will be asking these questions. With that in mind, I did some research and found there are a number of options, which are listed here. Thank you to the Microsoft people who helped me to come up with some options.

1. Create an Azure SQL Database spoke as part of a hub and spoke architecture.

The Azure SQL Database spoke can create external tables over Azure SQL Datawarehouse tables for moving data into Azure SQL Database to move data into the spoke. One note of warning: It isn’t possible to use DDM over an external table, so the data would have to move into Azure SQL Database.
2. Embed masking logic in views and restrict access.

This is achievable but it is a manual process.
3. Mask the data through the ETL processes creating a second, masked, column.

This depends on the need to query the data. Here, you may need to limit access through stored procs.
On balance, the simplest method overall is to use views to restrict access to certain columns. That said, I an holding a workshop with the customer in the near future in order to see their preferred options. However, I thought that this might help someone else, in the meantime. I hope that you find something that will help you to manage your particular scenario.

Jen’s Diary: Why are PASS doing Business Analytics at all?

As always, I don’t speak for PASS. This is a braindump from the heart. I realise that we haven’t communicated about BA as much as some members might like. It’s a hard balance – I don’t want to spam people, and I don’t want to get it too light, either. If you want to sign up for PASS BA news, here’s the link. So I have to apologise here, and hold my hands up for that one. I’ll endeavour to ensure we have a better BA communications plan in place, and i’m meeting the team on Friday to discuss how we can make that happen.

In the meantime, I’d like to blog about BA today. How did we get here, and where are we going? Why are PASS interested in Business Analytics at all? To answer this question, let’s look at the history of Business Intelligence, what Business Analytics means, and how PASS can be part of the story. Let’s start with the history lesson. What are the stages of Business Intelligence?

First generation Business Intelligence – this was the world of corporate Business Intelligence. You’ll know this by the phrase ‘the single source of truth’. This was a very technical discipline, focused on the data warehouse. It was dominated by Kimball methodology, or Imon methodology, dependent on the business requirement. However, the business got lost in all this somewhere, and they reverted to the default position of using Excel as a tool to work with Excel exports, and subverting the IT departments by storing data in email. Microsoft did – and still do – cater for the first generation of business intelligence. It has diversified into new cloud products, of course, but SQL Server still rocks. You’ll have seen that Gartner identified SQL Server as the number one RDBMS for 2015. Kudos to the team! For an overview, the Computer Weekly article is interesting.

Second generation Business Intelligence – the industry pivoted to bring the Business back into Business Intelligence. You’ll know this by the phrase ‘self-service business intelligence’. Here, the business user was serviced with clean data sources that they could mash and merge together, and they were empowered to connect to these sources. In the Microsoft sphere, this involved a proliferation of tabular models, PowerPivot as well as continued use of analysis services multidimensional models. As before, Excel remained the default position for working with data. PASS Summit 2015 has a lot of content in both of these areas.

So far, so good. PASS serves a community need by offering high quality, community education on all of these technologies. Sorted, right?

Wrong. The world of data keeps moving. Let’s look at the projected growth of Big Data by Forbes.

Well, the world of business intelligence isn’t over yet; we now have business analytics on the horizon and the world of data is changing fast. We need to keep up! But what do we do with all this data? This is the realm of Business Analytics, and why is it different from BI? The value of business analytics lies in its ability to deliver better outcomes. It’s a different perspective. Note from our first generation and our second generation BI times, technology was at the forefront of the discussion. In business analytics, we talk about organizational change, enabled by technology. In this sphere, we have to quantify and communicate value as the outcome, not the technology as a means to get there. So what comes next?

Third generation of business intelligence – self-service analytics. Data visualisation software has been at the forefront of second generation Business Intelligence, and it has taken a priority. Here, the position is taken that businesses will understand that they need data visualisation technologies as well as analytical tools, to use the data for different purposes.

How is Business Analytics an extension of Business Intelligence? Let’s look at some basic business questions, and see how they fall as BI or BA. Images belong to Gartner so all kudos and copyright to the team over there.

What happened?

If the promise of business intelligence is to be believed, then we have our clean data sources, and we can describe the current state of the business. Gartner call this descriptive analytics, and it answers the question: What happened? This level is our bread-and-butter business intelligence, with an emphasis on the time frame until this current point in time.

Why did it happen?

We can also understand, to a degree, why we are where we are. This is called diagnostic analytics, and it can help pinpoint issues in the organisation. Business Intelligence is a great domain for understanding the organisation until this point in time. However, it’s a rearview impressio of the data. What happens next? Now, we start to get into the remit of Business Analytics:

What will happen?

Businesses want to know what will happen next. Gartner call this predictive analytics, and this perception occurs when we want to try and look for predictive patterns in the data. Once we understand what will happen next, what is the next question?

How can we make this happen?

This is the power of prescriptive analytics; it tells us what we should do, and it is the holy grail of analytics. It uses business intelligence data in order to understand the right path to take, and it builds on the other types of analytics.

Business Intelligence and Business Analytics are a continuum. Analytics is focused more on a forward motion of the data, and a focus on value. People talk about ROI, TCO, making good business decisions based on strong data. First generation and second generation are not going away. A cursory look around a lot of organisations will tell you that. The Third Generation, however, is where organisations start to struggle a bit. PASS can help folks navigate their way towards this new generation of data in the 21st century.

How do we measure value? It is not just about storing the data, protecting it and securing it. These DBA functions are extremely valuable and the business would not function without them – full stop.  So how do we take this data and use it as a way of moving the organisation? We can work with the existing data to improve it; understand and produce the right measures of return, profiling, or other benefits such as team work. Further, analytics is multi-disciplinary. It straddles the organisation, and it has side effects that you can’t see, immediately. This is ‘long term vision’ not ‘operational, reactive, here-and-now’. Analytics can effect change within the organisation, as the process of doing analytics itself means that the organization solves a business problem, which it then seeks to re-apply across different silos within the organization.

SQL Server, on the other hand, is a technology. It is an on-premise relational database technology, which is aimed at a very specific task. This is a different, technologically based perspective. The perspectives in data are changing, as this Gartner illustration taken from here shows:

Why do we need a separate event? We need to meet different people’s attitudes towards data. DBAs have a great attitude; protect, cherish, secure data. BAs also have a great attitude: use, mix, apply learnings from data. You could see BA as a ‘special interest group’ which offers people a different choice. There may not be enough of this material for them at PASS Summit, so they get their own event. If someone wants to go ahead and have a PASS SQLSaturday event which is ‘special interest’ and focuses solely on, say, performance or disaster recovery, for example, then I don’t personally have a problem with that.  I’d let them rock on with it. It might bring in new members, and it offers a more niche offering to people who may or may not attend PASS because they don’t feel that there’s enough specialised, in depth, hard-core down-to-the-metal disaster recovery material in there for them. Business Analytics is the same, by analogy. Hundreds and hundreds of people attended my 3 hour session on R last year; so there is an interest. I see the BA event as a ‘little sister’ to the PASS ‘big brother’ – related, but not quite the same.

Why Analytics in particular? It’s about PASS growth. To grow, it can be painful, and you take a risk. However, I want to be sure that PASS is still growing to meet future needs of the members, as well as attracting new members to the fold However, the feetfall we see at PASS BA, plus our industry-recognised expert speakers, tell us that we are growing in the right direction. Let’s take a look at our keynote speaker, Jer Thorpe, has done work with NASA, the MOMA in New York, he was Data artist in residence at the New York Times and he’s now set up. The Office for Creative Research & adjunct professor at ITP. Last year, we had Mico Yuk, who is author of Dataviz for Dummies, as well as heading up her own consultancy team over at BI Brainz. They are industry experts in their own right, and I’m delighted to add them as part of our growing PASS family who love data.

The PASS BA event also addresses the issue of new and emerging data leaders. How do you help drive your organisation towards becoming a data-oriented organisation? This means that you talk a new language; we talk about new criteria for measuring value, working out return on investment, cross-department communication, and communication of ideas, conclusions to people throughout the organisation, even at the C-level executives. PASS BA is also looking at the career trajectories of these people as well as DBA-oriented folks, and PASS BA is out there putting the ‘Professional’ aspect into the event. We have a separate track, Communicate and Lead, which is all about data leadership and professional development. A whole track – the little sister is smartly bringing the Professional back, folks, and it’s part of our hallmark.

PASS is part of this story of data in the 21st Century. The ‘little sister’ still adds value to the bigger PASS membership, and is an area of growth for the family of PASS.

Any questions, I’m at jen.stirrup@sqlpass.org or please do come to the Board Q&A and ask questions there. If you can’t make it, tweet me at jenstirrup and I’ll see if I can catch them during the Q&A.

Partitioning Summary: SQL Server 2005 vs SQL Server 2008

Sometimes, I hear some refusals to upgrade from SQL Server 2005 to SQL Server 2008 on the grounds that ‘2005 is working fine, so we don’t need to upgrade – I’ll keep this ticking over until it fails’.That might be suitable for some environments, but what happens in the case where the database is growing quickly? A bloated fact table, for example? The problems of a large fact table are the worst type: they are obvious to end users since it becomes unwieldy to load and unload data, and the queries just take too long. In today’s world of instant results, people need answers asap.

It is worth considering partitioning as a measure to help navigate your way through this issue, and this post aims to provide a brief summary of the benefits. Firstly introduced in SQL Server Enterprise Edition 2005 (not Standard), partitioning can help to make large tables quicker and easier to access. Essentially, table partitioning is a design feature that assists in ensuring that large tables are navigated, loaded and matured in a way that is optimised for performance whilst being easier to administer. Candidate tables for partitioning are those which contain lots of data, or the table is not performing, or it is becoming difficult to maintain.

In SQL Server, partitioning a large table means dividing the table and its indexes into smaller segments. This means that administrative operations can be applied on the smaller partition, rather than on the entire table. Further, the query optimizer can mobilize the query to navigate appropriate partitions of the table. To summarise, this means that partitioning can really help to optimize the table!

Table partitioning is horizontal – imagine a striped table, with groups of rows associated with each stripe – or partition. These strips, or partitions, can be spread across filegroups. Partitioning can be managed separately. For example, data sets can be split data into staging areas, taken offline for dedicated maintenance, and then re-added to partitioned tables – how good is that? Back to the main point, though – why is partitioning better in SQL Server 2008? Here is a summary of reasons:

Locking Resolution – SQL Server partitioning functionality works with the underlying SQL Server engine to introduce an intermediate stage of locking – locking is taken to the partition level, and then escalated the table level. This means that locks can be captured before they have a real impact.

Parallel Processing – SQL Server 2008 can use parallel processing for querying partitioned tables, rather than using just one processor for each partition. Thus, queries ran against partitioned tables are optimised to run faster and more efficiently, using the processor resources available. More information can be found here.  Also partitioned queries can be optimised by bitmap filters, which can be dynamically initiated by the SQL Server Query Optimiser. Note that the bitmap filtering is based on using a table that is partitioned on an integer, simulating a date.

Indexed views – in SQL Server 2008, can follow the partitioning scheme of the underlying tables. If it is aligned correctly, the indexed view does not need to be dropped before it is switched out of the partitioned table – as is the case with SQL Server 2005, where materialized data had to be dropped and created again at each switch. 

To summarise, moving from SQL Server 2005 to 2008 is worth considering, in order to take advantage of the new partitioning features. However, it’s always worth ensuring that it’s being implemented properly and for the right reasons, so it’s worthwhile doing your research first. Brent Ozar wrote a great blog on this a while ago, and it’s worth a look.
I hope that helps!

Indexing: Fact and Dimension guidelines

Indexing is part of the art of optimising the database structure. However, there are no real ‘hard and fast’ rules since it depends, ultimately, on query use.  Here are some suggestions below:

Dimension tables

  • Dimension tables should have a clustered primary key index. This assumes that there is a surrogate primary key as per Kimball’s advice.
  • Dimension tables may have an index on the natural business key.
  • Dimension tables may also have a single-index index on the foreign keys
  • Dimension tables may have non-key indexes added to hierarchical relationships.

Caveats

It will be important to gauge the size of the dimension, and its performance when adding indexes on the natural business key, and the foreign keys. It’s a question of size. If the dimensions are small, it may not be worthwhile; on the flip side, it is important to look at the query plan for optimisations.

Fact Tables

  • If the fact tables are partitioned, add a single-column index to the date key on the partition.
  • If the fact tables are not partitioned, the date key is a good place to start to add a single-column index. 

Fact Tables and Dimension Tables together

You can add a foreign-key reference between the fact tables and the dimensions; this will help to maintain referential integrity within the database itself. However, this can be expensive in terms of processing, since this will be checked every time the ETL load inserts a row into the fact table. If your fact table is large, and there is a substantial amount of data to be inserted, then this could take a long time.  In the fact/dimension world, this check is already done when the lookup of the fact table takes place – so you may not need to repeat it.

Recommendation: Testing and more testing! Evaluate your query plan and check out the INSERT part in particular.

Hope that helps someone!

Add to Technorati Favorites

How is it best to handle unaudited Fact Table Updates and Deletes?

It is more difficult to conduct updates and deletes on a fact table, than to simply insert new rows. If the source system handles the updates and deletes, then this is the preferable option since it is already in place. Unfortunately, however, this is not always the case. How is it possible to handle unaudited updates and deletes?
Updates
If the changed rows are flagged with a ‘Modified’ date column, then this means we can more easily identify the rows that have been changed. If this isn’t the case, how is it possible to go about identifying which rows have been changed? There are a few options:
Truncate and re-load the fact table at every load – this isn’t a good suggestion because some fact tables may be very frequently updated. This would not provide a good service to users, who rely on the information from the fact table.
Ask your friendly DBA to look at helping you e.g. triggers, replication, or adding a ‘Modified’ date to the source transaction data – this is a better idea. It is better to push the issue ‘upstream’ so that the data is loaded as simply into the new data warehouse as possible. This also allows the opportunity for re-use, which is always good. If you’d like to know more about using replication to find unaudited changes in the fact table, please see Duncan Sutcliffe’s excellent post here.
Deletes
It’s better to distinguish between ‘soft’ deletes and ‘hard’ deletes.
• ‘Soft’ deletes means that the row is not actually deleted, but simply flagged as deleted.
• ‘Hard’ deletes – it’s gone!
Soft deletes are obviously better; an additional column can be added as a flag, and the flag is set to denote whether the record is deleted or not.
The Kimball methodology argues that updates and deletes should be handled as transactions. However, that isn’t possible in every case, and it’s probably better to apply this strategy for incremental data such as sales. Let’s take a non-incremental example, where a company has phone record data. The phone record may hold information initially such as the start time of the call. However, any ‘on hold’ time or ‘end time’ associated with the call cannot be recorded until the call ends. Thus, the update needs to be done ‘in situ’ or there will be lots of additional rows with empty columns, all for one call.
Although the structure can be difficult to decide, Kimball usually has some good words of wisdom; the trick is knowing when to apply it to your particular circumstance and user requirements, and I do recommend reading Kimball as often as possible!

Add to Technorati Favorites