Dynamic Data Masking in Azure SQL Datawarehouse

I’m leading a project which is using Azure SQL Datawarehouse, and I’m pretty excited to be involved.  I love watching the data take shape, and, for the customer requirements, Azure SQL Datawarehouse is perfect.

secret-3037639_640 Note that my customer details are confidential and that’s why I never give details away such as the customer name and so on. I gain – and retain – my customers based on trust, and, by giving me their data, they are entrusting me with detailed information about their business.

One question they raised was in respect to dynamic data masking, which is present in Azure SQL Database. How does it manifest itself in Azure SQL Datawarehouse? What are the options regarding the management of personally identifiable information?

sasint

As we move ever closer to the implementation of GDPR, more and more people will be asking these questions. With that in mind, I did some research and found there are a number of options, which are listed here. Thank you to the Microsoft people who helped me to come up with some options.

1. Create an Azure SQL Database spoke as part of a hub and spoke architecture.

The Azure SQL Database spoke can create external tables over Azure SQL Datawarehouse tables for moving data into Azure SQL Database to move data into the spoke. One note of warning: It isn’t possible to use DDM over an external table, so the data would have to move into Azure SQL Database.
2. Embed masking logic in views and restrict access.

This is achievable but it is a manual process.
3. Mask the data through the ETL processes creating a second, masked, column.

This depends on the need to query the data. Here, you may need to limit access through stored procs.
On balance, the simplest method overall is to use views to restrict access to certain columns. That said, I an holding a workshop with the customer in the near future in order to see their preferred options. However, I thought that this might help someone else, in the meantime. I hope that you find something that will help you to manage your particular scenario.

How do you know if your org is ready for Data Science? Starting your journey with Azure Databricks

Mentioning  data science at your company may give you an air of expertise, but actually implementing enterprise-wide transformation with data science, artificial intelligence or deep learning is a business-wide transformation activity. It impacts your data and analytics infrastructure, engineering and business interactions, and even your organizational culture. In this post, we will look at a few high-level things to watch out for before you get started, along with a suggestion that you can try Azure Databricks as a great starting point for  your cloud and data science journey.

Note: not all companies are ready for data science. Many of them are still struggling with Excel. This article is meant for you.

So how can you move forward?

1. Have a data amnesty

If you’re still struggling with Excel, then data science and AI can seem pretty far off. Have a data amnesty – ask everyone to identify their key data sources so you can back them up, protect them, and share them better where appropriate.

2. Determine the Data (Im)maturity in your organization.

Take a peep at the following table: where is your organization located?

Democratization of Data

Note: this idea was inspired by Bernard Liautaud

Ideally, you’re headed towards a data democracy, where IT are happy in their guardianship role, and the users have got their data. If this equilibrium isn’t in place, them this could potentially derail your budding data science project. Working on these issues can help your success to be sustainable in the longer-term.

3. All that glitters isn’t data gold

This is the ‘shiny gadget’ syndrome. Don’t get distracted by the shiny stuff. You need your data vegetables before you can have your data candy.

Focus on the business problem you’re trying to solve, not the technology. You will need to think about the success criteria.

You should be using the technology to improve a business process, with clear goals and measurable success. Otherwise it can be disorganized, with a veil of organization that is disguised by the technology.

contacts-3079618_1920

4. Fail to plan, plan to fail

If you fail… that’s ok. You learned ten things you didn’t know before. Next time, plan better, scope better, do better.

How to get started?

Starting in the cloud is a great way to get started. It means that you’re not purchasing a lot of technology and hardware that you don’t need. Abraham Maslow was once quoted as saying “If you only have a hammer, you tend to see every problem as a nail.” Those words are truer than ever as an increasingly complex and interconnected world makes selecting the right tools for the data estate. With that in mind, the remainder of this blog talks about Azure Databricks as a step for data science for the new organization in order to reduce risk, initial outlay and costs.

 

 

What is Microsoft Azure Databricks?

default-open-graphAzure Databricks was designed in collaboration with Microsoft and the creators of Apache Spark. It is designed for easy data science: one-click set up, streamlined workflows and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. Each of this roles will have a different style describes how users want to interact with, present, and share information, bearing in mind the varying skillsets of both business users and IT.

So what is Apache Spark? According to Databricks, Apache Spark is the largest open source process in data processing. From the enterprise perspective, Apache Spark has seen rapid adoption by enterprises across a wide range of industries.

So what does Apache Spark give you? Apache Spark is a fast, in-memory data processing engine. For the serious data science organisation, it allows developers to use expressive development APIs to work with data. For information and data workers, they have the ability to execute streaming analytics, longer-term machine learning or SQL workloads – fast. Implemented in Azure, it means that the business users can use Power BI in order to understand their data better.

Apache Spark consists of Spark Core and a set of libraries. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed ETL application development. Spark lets you quickly write applications in Java, Scala, or Python. It comes with a built-in set of over 80 high-level operators. And you can use it interactively to query data within the shell.

The Apache Spark functionality is incorporated in Azure Databricks

In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use case.

It supports in-memory processing to boost the performance of big-data analytic applications, and it works with other Azure data stores such as Azure SQL Data Warehouse, Azure Cosmos DB, Azure Data Lake Store, Azure Blob storage, and Azure Event Hub.

What is so special about Apache Spark, anyway?

For the enterprise and data architects, it can give you the opportunity to have everything in one place: streaming, ML libraries, sophisticated analytics, data visualization. It means that you can streamline in one technological umbrella, but have your data in other data sources such as Azure SQL Data Warehouse, Azure Cosmos DB, Azure Data Lake Store, Azure Blob storage, and Azure Event Hub.

As an architect, I aim to reduce points of failure and points of complexity, so it is the neatness of the final streamlined technology solution that is appealing.

It is also fast, and people want their data fast. Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk. Spark makes it possible by reducing number of read/write to disc. It stores this intermediate processing data in-memory. It uses the concept of Resilient Distributed Dataset (RDD), which allows it to transparently store data on memory and persist it to disc only it’s needed. This helps to reduce most of the disc read and write the main time consuming factors of data processing.

 

Data Visualization for the Business User with Azure Databricks as a basis

Azure Databricks brings multi-editable documents for data engineering and data science in real-time. It also enables dashboards with Power BI for accurate, efficient and accessible data visualization across the business.

Azure Databricks is backed by Azure Database and other technologies that enable highly concurrent access, fast performance and geo-replication, along with Azure security mechanisms.

Summary

Implementing enterprise-wide transformation with data science, artificial intelligence or deep learning is a business-wide transformation activity. In this post, there is the suggestion that you can try Azure Databricks as a great starting point for  your cloud and data science journey, with some advice on getting a good ground before you start.