Dynamic Data Masking in Azure SQL Datawarehouse

I’m leading a project which is using Azure SQL Datawarehouse, and I’m pretty excited to be involved.  I love watching the data take shape, and, for the customer requirements, Azure SQL Datawarehouse is perfect.

secret-3037639_640 Note that my customer details are confidential and that’s why I never give details away such as the customer name and so on. I gain – and retain – my customers based on trust, and, by giving me their data, they are entrusting me with detailed information about their business.

One question they raised was in respect to dynamic data masking, which is present in Azure SQL Database. How does it manifest itself in Azure SQL Datawarehouse? What are the options regarding the management of personally identifiable information?


As we move ever closer to the implementation of GDPR, more and more people will be asking these questions. With that in mind, I did some research and found there are a number of options, which are listed here. Thank you to the Microsoft people who helped me to come up with some options.

1. Create an Azure SQL Database spoke as part of a hub and spoke architecture.

The Azure SQL Database spoke can create external tables over Azure SQL Datawarehouse tables for moving data into Azure SQL Database to move data into the spoke. One note of warning: It isn’t possible to use DDM over an external table, so the data would have to move into Azure SQL Database.
2. Embed masking logic in views and restrict access.

This is achievable but it is a manual process.
3. Mask the data through the ETL processes creating a second, masked, column.

This depends on the need to query the data. Here, you may need to limit access through stored procs.
On balance, the simplest method overall is to use views to restrict access to certain columns. That said, I an holding a workshop with the customer in the near future in order to see their preferred options. However, I thought that this might help someone else, in the meantime. I hope that you find something that will help you to manage your particular scenario.

Jen’s Diary: Why are PASS doing Business Analytics at all?

As always, I don’t speak for PASS. This is a braindump from the heart. I realise that we haven’t communicated about BA as much as some members might like. It’s a hard balance – I don’t want to spam people, and I don’t want to get it too light, either. If you want to sign up for PASS BA news, here’s the link. So I have to apologise here, and hold my hands up for that one. I’ll endeavour to ensure we have a better BA communications plan in place, and i’m meeting the team on Friday to discuss how we can make that happen.

In the meantime, I’d like to blog about BA today. How did we get here, and where are we going? Why are PASS interested in Business Analytics at all? To answer this question, let’s look at the history of Business Intelligence, what Business Analytics means, and how PASS can be part of the story. Let’s start with the history lesson. What are the stages of Business Intelligence?

First generation Business Intelligence – this was the world of corporate Business Intelligence. You’ll know this by the phrase ‘the single source of truth’. This was a very technical discipline, focused on the data warehouse. It was dominated by Kimball methodology, or Imon methodology, dependent on the business requirement. However, the business got lost in all this somewhere, and they reverted to the default position of using Excel as a tool to work with Excel exports, and subverting the IT departments by storing data in email. Microsoft did – and still do – cater for the first generation of business intelligence. It has diversified into new cloud products, of course, but SQL Server still rocks. You’ll have seen that Gartner identified SQL Server as the number one RDBMS for 2015. Kudos to the team! For an overview, the Computer Weekly article is interesting.

Second generation Business Intelligence – the industry pivoted to bring the Business back into Business Intelligence. You’ll know this by the phrase ‘self-service business intelligence’. Here, the business user was serviced with clean data sources that they could mash and merge together, and they were empowered to connect to these sources. In the Microsoft sphere, this involved a proliferation of tabular models, PowerPivot as well as continued use of analysis services multidimensional models. As before, Excel remained the default position for working with data. PASS Summit 2015 has a lot of content in both of these areas.

So far, so good. PASS serves a community need by offering high quality, community education on all of these technologies. Sorted, right?

Wrong. The world of data keeps moving. Let’s look at the projected growth of Big Data by Forbes.

Well, the world of business intelligence isn’t over yet; we now have business analytics on the horizon and the world of data is changing fast. We need to keep up! But what do we do with all this data? This is the realm of Business Analytics, and why is it different from BI? The value of business analytics lies in its ability to deliver better outcomes. It’s a different perspective. Note from our first generation and our second generation BI times, technology was at the forefront of the discussion. In business analytics, we talk about organizational change, enabled by technology. In this sphere, we have to quantify and communicate value as the outcome, not the technology as a means to get there. So what comes next?

Third generation of business intelligence – self-service analytics. Data visualisation software has been at the forefront of second generation Business Intelligence, and it has taken a priority. Here, the position is taken that businesses will understand that they need data visualisation technologies as well as analytical tools, to use the data for different purposes.

How is Business Analytics an extension of Business Intelligence? Let’s look at some basic business questions, and see how they fall as BI or BA. Images belong to Gartner so all kudos and copyright to the team over there.

What happened?

If the promise of business intelligence is to be believed, then we have our clean data sources, and we can describe the current state of the business. Gartner call this descriptive analytics, and it answers the question: What happened? This level is our bread-and-butter business intelligence, with an emphasis on the time frame until this current point in time.

Why did it happen?

We can also understand, to a degree, why we are where we are. This is called diagnostic analytics, and it can help pinpoint issues in the organisation. Business Intelligence is a great domain for understanding the organisation until this point in time. However, it’s a rearview impressio of the data. What happens next? Now, we start to get into the remit of Business Analytics:

What will happen?

Businesses want to know what will happen next. Gartner call this predictive analytics, and this perception occurs when we want to try and look for predictive patterns in the data. Once we understand what will happen next, what is the next question?

How can we make this happen?

This is the power of prescriptive analytics; it tells us what we should do, and it is the holy grail of analytics. It uses business intelligence data in order to understand the right path to take, and it builds on the other types of analytics.

Business Intelligence and Business Analytics are a continuum. Analytics is focused more on a forward motion of the data, and a focus on value. People talk about ROI, TCO, making good business decisions based on strong data. First generation and second generation are not going away. A cursory look around a lot of organisations will tell you that. The Third Generation, however, is where organisations start to struggle a bit. PASS can help folks navigate their way towards this new generation of data in the 21st century.

How do we measure value? It is not just about storing the data, protecting it and securing it. These DBA functions are extremely valuable and the business would not function without them – full stop.  So how do we take this data and use it as a way of moving the organisation? We can work with the existing data to improve it; understand and produce the right measures of return, profiling, or other benefits such as team work. Further, analytics is multi-disciplinary. It straddles the organisation, and it has side effects that you can’t see, immediately. This is ‘long term vision’ not ‘operational, reactive, here-and-now’. Analytics can effect change within the organisation, as the process of doing analytics itself means that the organization solves a business problem, which it then seeks to re-apply across different silos within the organization.

SQL Server, on the other hand, is a technology. It is an on-premise relational database technology, which is aimed at a very specific task. This is a different, technologically based perspective. The perspectives in data are changing, as this Gartner illustration taken from here shows:

Why do we need a separate event? We need to meet different people’s attitudes towards data. DBAs have a great attitude; protect, cherish, secure data. BAs also have a great attitude: use, mix, apply learnings from data. You could see BA as a ‘special interest group’ which offers people a different choice. There may not be enough of this material for them at PASS Summit, so they get their own event. If someone wants to go ahead and have a PASS SQLSaturday event which is ‘special interest’ and focuses solely on, say, performance or disaster recovery, for example, then I don’t personally have a problem with that.  I’d let them rock on with it. It might bring in new members, and it offers a more niche offering to people who may or may not attend PASS because they don’t feel that there’s enough specialised, in depth, hard-core down-to-the-metal disaster recovery material in there for them. Business Analytics is the same, by analogy. Hundreds and hundreds of people attended my 3 hour session on R last year; so there is an interest. I see the BA event as a ‘little sister’ to the PASS ‘big brother’ – related, but not quite the same.

Why Analytics in particular? It’s about PASS growth. To grow, it can be painful, and you take a risk. However, I want to be sure that PASS is still growing to meet future needs of the members, as well as attracting new members to the fold However, the feetfall we see at PASS BA, plus our industry-recognised expert speakers, tell us that we are growing in the right direction. Let’s take a look at our keynote speaker, Jer Thorpe, has done work with NASA, the MOMA in New York, he was Data artist in residence at the New York Times and he’s now set up. The Office for Creative Research & adjunct professor at ITP. Last year, we had Mico Yuk, who is author of Dataviz for Dummies, as well as heading up her own consultancy team over at BI Brainz. They are industry experts in their own right, and I’m delighted to add them as part of our growing PASS family who love data.

The PASS BA event also addresses the issue of new and emerging data leaders. How do you help drive your organisation towards becoming a data-oriented organisation? This means that you talk a new language; we talk about new criteria for measuring value, working out return on investment, cross-department communication, and communication of ideas, conclusions to people throughout the organisation, even at the C-level executives. PASS BA is also looking at the career trajectories of these people as well as DBA-oriented folks, and PASS BA is out there putting the ‘Professional’ aspect into the event. We have a separate track, Communicate and Lead, which is all about data leadership and professional development. A whole track – the little sister is smartly bringing the Professional back, folks, and it’s part of our hallmark.

PASS is part of this story of data in the 21st Century. The ‘little sister’ still adds value to the bigger PASS membership, and is an area of growth for the family of PASS.

Any questions, I’m at jen.stirrup@sqlpass.org or please do come to the Board Q&A and ask questions there. If you can’t make it, tweet me at jenstirrup and I’ll see if I can catch them during the Q&A.

Partitioning Summary: SQL Server 2005 vs SQL Server 2008

Sometimes, I hear some refusals to upgrade from SQL Server 2005 to SQL Server 2008 on the grounds that ‘2005 is working fine, so we don’t need to upgrade – I’ll keep this ticking over until it fails’.That might be suitable for some environments, but what happens in the case where the database is growing quickly? A bloated fact table, for example? The problems of a large fact table are the worst type: they are obvious to end users since it becomes unwieldy to load and unload data, and the queries just take too long. In today’s world of instant results, people need answers asap.

It is worth considering partitioning as a measure to help navigate your way through this issue, and this post aims to provide a brief summary of the benefits. Firstly introduced in SQL Server Enterprise Edition 2005 (not Standard), partitioning can help to make large tables quicker and easier to access. Essentially, table partitioning is a design feature that assists in ensuring that large tables are navigated, loaded and matured in a way that is optimised for performance whilst being easier to administer. Candidate tables for partitioning are those which contain lots of data, or the table is not performing, or it is becoming difficult to maintain.

In SQL Server, partitioning a large table means dividing the table and its indexes into smaller segments. This means that administrative operations can be applied on the smaller partition, rather than on the entire table. Further, the query optimizer can mobilize the query to navigate appropriate partitions of the table. To summarise, this means that partitioning can really help to optimize the table!

Table partitioning is horizontal – imagine a striped table, with groups of rows associated with each stripe – or partition. These strips, or partitions, can be spread across filegroups. Partitioning can be managed separately. For example, data sets can be split data into staging areas, taken offline for dedicated maintenance, and then re-added to partitioned tables – how good is that? Back to the main point, though – why is partitioning better in SQL Server 2008? Here is a summary of reasons:

Locking Resolution – SQL Server partitioning functionality works with the underlying SQL Server engine to introduce an intermediate stage of locking – locking is taken to the partition level, and then escalated the table level. This means that locks can be captured before they have a real impact.

Parallel Processing – SQL Server 2008 can use parallel processing for querying partitioned tables, rather than using just one processor for each partition. Thus, queries ran against partitioned tables are optimised to run faster and more efficiently, using the processor resources available. More information can be found here.  Also partitioned queries can be optimised by bitmap filters, which can be dynamically initiated by the SQL Server Query Optimiser. Note that the bitmap filtering is based on using a table that is partitioned on an integer, simulating a date.

Indexed views – in SQL Server 2008, can follow the partitioning scheme of the underlying tables. If it is aligned correctly, the indexed view does not need to be dropped before it is switched out of the partitioned table – as is the case with SQL Server 2005, where materialized data had to be dropped and created again at each switch. 

To summarise, moving from SQL Server 2005 to 2008 is worth considering, in order to take advantage of the new partitioning features. However, it’s always worth ensuring that it’s being implemented properly and for the right reasons, so it’s worthwhile doing your research first. Brent Ozar wrote a great blog on this a while ago, and it’s worth a look.
I hope that helps!

Indexing: Fact and Dimension guidelines

Indexing is part of the art of optimising the database structure. However, there are no real ‘hard and fast’ rules since it depends, ultimately, on query use.  Here are some suggestions below:

Dimension tables

  • Dimension tables should have a clustered primary key index. This assumes that there is a surrogate primary key as per Kimball’s advice.
  • Dimension tables may have an index on the natural business key.
  • Dimension tables may also have a single-index index on the foreign keys
  • Dimension tables may have non-key indexes added to hierarchical relationships.


It will be important to gauge the size of the dimension, and its performance when adding indexes on the natural business key, and the foreign keys. It’s a question of size. If the dimensions are small, it may not be worthwhile; on the flip side, it is important to look at the query plan for optimisations.

Fact Tables

  • If the fact tables are partitioned, add a single-column index to the date key on the partition.
  • If the fact tables are not partitioned, the date key is a good place to start to add a single-column index. 

Fact Tables and Dimension Tables together

You can add a foreign-key reference between the fact tables and the dimensions; this will help to maintain referential integrity within the database itself. However, this can be expensive in terms of processing, since this will be checked every time the ETL load inserts a row into the fact table. If your fact table is large, and there is a substantial amount of data to be inserted, then this could take a long time.  In the fact/dimension world, this check is already done when the lookup of the fact table takes place – so you may not need to repeat it.

Recommendation: Testing and more testing! Evaluate your query plan and check out the INSERT part in particular.

Hope that helps someone!

Add to Technorati Favorites

How is it best to handle unaudited Fact Table Updates and Deletes?

It is more difficult to conduct updates and deletes on a fact table, than to simply insert new rows. If the source system handles the updates and deletes, then this is the preferable option since it is already in place. Unfortunately, however, this is not always the case. How is it possible to handle unaudited updates and deletes?
If the changed rows are flagged with a ‘Modified’ date column, then this means we can more easily identify the rows that have been changed. If this isn’t the case, how is it possible to go about identifying which rows have been changed? There are a few options:
Truncate and re-load the fact table at every load – this isn’t a good suggestion because some fact tables may be very frequently updated. This would not provide a good service to users, who rely on the information from the fact table.
Ask your friendly DBA to look at helping you e.g. triggers, replication, or adding a ‘Modified’ date to the source transaction data – this is a better idea. It is better to push the issue ‘upstream’ so that the data is loaded as simply into the new data warehouse as possible. This also allows the opportunity for re-use, which is always good. If you’d like to know more about using replication to find unaudited changes in the fact table, please see Duncan Sutcliffe’s excellent post here.
It’s better to distinguish between ‘soft’ deletes and ‘hard’ deletes.
• ‘Soft’ deletes means that the row is not actually deleted, but simply flagged as deleted.
• ‘Hard’ deletes – it’s gone!
Soft deletes are obviously better; an additional column can be added as a flag, and the flag is set to denote whether the record is deleted or not.
The Kimball methodology argues that updates and deletes should be handled as transactions. However, that isn’t possible in every case, and it’s probably better to apply this strategy for incremental data such as sales. Let’s take a non-incremental example, where a company has phone record data. The phone record may hold information initially such as the start time of the call. However, any ‘on hold’ time or ‘end time’ associated with the call cannot be recorded until the call ends. Thus, the update needs to be done ‘in situ’ or there will be lots of additional rows with empty columns, all for one call.
Although the structure can be difficult to decide, Kimball usually has some good words of wisdom; the trick is knowing when to apply it to your particular circumstance and user requirements, and I do recommend reading Kimball as often as possible!

Add to Technorati Favorites

Greenfield Business Intelligence – where do we start?

Generally, the human condition means that we often don’t know when to start doing, and don’t know when to stop. I’m sure you can think of plenty of examples in life! This happens in business a lot; as an illustrative example, some managers might not be willing to stop a strategy which the evidence shows isn’t working…. conversely, they don’t know when to start a new strategy.

The human condition also applies to business intelligence, since it involves people. However, there is the added problem of not knowing how to start, or how to stop. Greenfield business intelligence projects might sound like a dream; completely new technology, completely new solution to design, build and test, and there’s no hangover from previous bad coding.

Greenfield business intelligence does give us another set of problems, however. The human condition means that users don’t know what they want. This may sound strange, but sometimes users can perhaps lack confidence to say what they think that they might need now or in the future. After all, who wants to take the blame when something goes wrong later on? User input and experience is extremely valuable in producing business intelligence solutions and it’s very important to include them early on. So, if you can’t get the user input easily, what can you do?

Greenfield business intelligence isn’t all about technology; there are people issues as well.  In these situations, the following advice might help:
  • As a recommended strategy, producing reports in an agile way can help users to provide feedback. This will increase their confidence in the system, in addition to feeling that their input is valuable. It wouldn’t hurt to drive home that their input is necessary to make the project a success. This means that report changes can be done in a kaikaku way – iterative, smaller changes that constitute a part of the overall whole.
  • Remember that Excel is your friend; users love Excel. Users eally appreciate being able to see something. An ideal way to do this is to produce a few mock-up reports in Excel with some made-up data. Users tend to give an almost visceral response, for example, ‘I really like that, I really need that!’ or ‘ugh, not what I wanted  at all!’ This can help focus your efforts in planning the data that goes into the data warehouse.
  • Use a ‘traffic light’ system to identify what you must have, what you should have, and what’s not necessary. In project management speak, this is sometimes called the MosCow method: Must have, Should have, Could have. This is easy for everyone to understand, and can help move the requirements to move forward quite quickly.
  • All feedback is useful; the negative and the positive. If people don’t like a report or need the data, then it’s good to be ‘lean’ and not include it at all. Why give yourself extra work to include data which isn’t necessary?
  • In Kimball’s books, this is a real emphasis on this strategy and the first few chapters are about the ways of talking to business users. If you haven’t read any Kimball, I can really recommend the wisdom you’ll find there.
To summarise, although getting user input can be difficult, there are some tools to use which can help to obtain it more easily. It’s better to try and get user input in the planning phase rather than after you’ve spent months working on a business intelligence solution that does not answer the business questions. If that happens, the users won’t accept the solution and it’s ultimate failure.

I’d love your comments and if you’ve any other advice, please do leave a comment. I’d love to learn from your experiences!


Add to Technorati Favorites

User Input in Business Intelligence projects

Recently, I read Simon Sabin’s blog on this importance of intelligence in BI, which prompted me to revisit the idea of user inclusiveness in the process. I originally posted a blog as part of the Microsoft ‘Industry Insiders‘ series a while ago, and as I continue on my BI journey, I’ve updated my thoughts on the importance of users in BI projects as a result of going through different project experiences.

It can often be easy to underestimate users, and I’ve even heard a perspective that users should be completely insulated from the process because they are not ‘technical’. However, users can sometimes surprise you when ‘water cooler’ conversations reveal interesting things about their business. I produced some geospatial mapping graphs recently using Tableau, and I must say that I was very pleased with the results! Being based in the UK, I had imported a series of UK postcodes along with their corresponding longitude and latitude for use in the graphical representation of the data. A ‘water cooler’ discussion revealed that I didn’t need to do this; in fact, a distinct work stream at a completely separate part of the business involved on delivering GIS software, so this gave me another opportunity to use the best possible data source for the display of the data.

Users can be very creative at generating the data that they need to do their everyday jobs, and it can be useful to harness this creativity to generate proper data warehouses. User creativity can be seen in situations where there are lots of Excel spreadsheets or Access books lurking around the business, individually maintained and updated by users in the absence of a central data source that meets their requirements. Additionally, this creativity can mean that they find workarounds in manipulating their systems to do what they need. When these situations arise, it is important to collude with users to ensure that all the necessary data is in the data warehouse, in the format that they need. Thus, the business teams more likely to adopt the solution, and the business benefits overall because the data is centrally stored, backed up, and has a proper lineage. Don’t get me wrong, I love Excel; but I don’t like the situation where there are lots of Excel spreadsheets floating around, unchecked, and not validated enough to be considered a proper authentic data source.

The initial blog grew out of a discussion I’d had, where a number of people were debating the need to have a business representative or business-focused Project Manager involved in the design stages of a data warehouse. My own view is that business users should have input at every stage of the development. User buy-in is absolutely critical to the success of a business intelligence project, since they are the ultimate consumers of the information. If they don’t like it, they won’t use it – it’s as simple as that! Users need to be involved right from the start. They can save you a lot of additional, unforeseen work since they will tell you directly if anything is missing from the warehouse. I came across an example of this, where the report consumers were not involved at all until the training. During training, the users noted with a great deal of disappointment that quite a few of their key criteria were simply missing from the warehouse. This led to discouragement and a loss of user confidence in the system. The project was ultimately leading to failure; since the users could obtain these key criteria outside of the warehouse, they saw no need to use the new warehouse.

My brief was to turn this situation around to ensure successful delivery of the project. The first step in doing so was ensuring that the users all had their say in the content of the data warehouse. Since these omissions had been discovered during the training phase, the project had been at a mature stage of development at that point of time. Thus, there was a considerable amount of re-work at every stage of the project to ensure that the warehouse contained the business users’ requirements to support their decision making processes. Needless to say, if the business users had been involved at the beginning, there would have been no need for the extensive re-work.
When building a data warehouse, it can be tempting to exclude users and to ultimately dictate a solution to them. This is particularly the case where users have never been exposed to a business intelligence solution and have a long journey ahead of them to understand the difference between a data warehouse, and a database. This situation is compounded by the fact that developers are not always the most gregarious people, and may possibly even be naturally disinclined to speak with customers.

I appreciate that it can be difficult to explain complex concepts, such as cubes, to business users. One customer told me that a ‘grey fog of confusion’ descended on his brain whenever he heard the word ‘cube’! However, I have found that using Microsoft BI products have helped business users to really see, understand and use their data have the ‘Aha!’ moment and to really understand. For example, showing business users the contents of a cube in Excel or in the Analysis Services browser supports their learning journey to the ‘Aha!’ moment.
The users’ journey to understanding and confidence can be supported by actively involving them at different stages. One useful way of doing this is to get the user’s input when creating cubes. I achieved this by listing out the possible hierarchies and attributes in an Excel spreadsheet, asking the business team to organise them to reflect their business structures. Since they knew the business well, they did this easily. When they saw the completed cube, it was a real confidence boost for them to see that they had had direct input into the creation of the cube and that it was a correct translation of the data into their business needs. The business users’ confidence levels were increased by the fact that they could interact with the cube in the comfort zone of Excel. Ultimately, involving the customer at all stages directly helped them to accept the system. Getting business users to accept, understand and interact with a business intelligence solution is often one of the hardest parts of a BI project. However, in my experience as a BI practitioner, the Microsoft BI stack of products makes this easier since it is directly aimed at this group of users who will ultimately determine the success – or otherwise – of the Business Intelligence project. What’s the point of delivering a technically perfect solution if nobody will use it or can understand it? Solutions like
Tableau, XLCubed are an excellent addition to existing solutions since they enhance the analysis of the underlying data, and so they’re also worth considering.