Modelling your Data in Azure Data Lake

One of my project roles at the moment (I have a few!) is that I am architecting a major Azure implementation for a global brand. I’m also helping with the longer-term ‘vision’ of how that might shape up. I love this part of my job and I’m living my best life doing this piece; I love seeing a project take shape until the end users, whether they are business people or more strategic C-level, get the benefit of the data. At Data Relish, I make your data work for different roles organizations of every purse and every purpose, and I learn a lot from the variety of consulting pieces that I deliver.

If you’ve had even the slightest look at the Azure Portal, you will know that it has oodles of products that you can use in order to create an end-to-end solution. I selected Azure Data Lake for a number of reasons:

  • I have my eye on the Data Science ‘prize’ of doing advanced analytics later on, probably in Azure Databricks as well as Azure Data Lake. I want to make use of existing Apache Spark skills and Azure Data Lake is a neat solution that will facilitate this option.
  • I need a source that will cater for the shape of the data…. or the lack of it….
  • I need a location where the data can be accessed globally since it will be ingesting data from global locations.

In terms of tooling, there is always the Azure Data Lake tools for Visual Studio. You can watch a video on this topic here. But how do you get started with the design approach? So how do I go about the process of designing solutions for the Azure Data Lake? There are many different approaches and I have been implementing Kimball methodologies for years.

cellar

With this particular situation, I will be using the Data Vault methodology. I know that there are different schools of thought but I’ve learned from Dan Lindstedt in particular, who has been very generous in sharing his expertise; here is Dan’s website here. I have delivered this methodology elsewhere previously for an organization who have billions USD turnover, and they are still using the system that I put in place; it was particularly helpful approach for an acquisition scenario, for example.

 

Building a Data Vault starts with the modeling process, and this starts with a view of the existing datamodel of a transactional source system. The purpose of the data vault modelling lifecycle is to produce solutions to the business faster, at lower cost and with less risk, that also have a clear supported afterlife once I’ve moved onto another project for another customer.

 

Data Vault is a database modeling technique where the data is considered to belong to one of three entity types: hubs, links,and satellites:

 

  • Hubs contain the key attributes of business entities (such as geography, products, and customers)
  • Links define the relations between the hubs (for example, customer orders or product categories).

 

  • Satellites contain all other attributes related to hubs or links. Satellites include all attribute change history.

 

The result is an Entity Relationship Diagram (ERD), which consists of Hubs, Links and Satellites. Once I’d settled on this methodology, I needed to hunt around for something to use.

How do you go about designing and using an ERD tool for a Data Vault? I found a few options. For the enterprise, I found  WhereScape® Data Vault Express. That looked like a good option, but I had hoped to use something open-source so other people could adopt it across the team. It wasn’t clear how much it would cost, and, in general, if I have to ask then I can’t afford it! So far, I’ve settled on SQL Power Architect so that I can get the ‘visuals’ across to the customer and the other technical team, including my technical counterpart at the customer who picks up when I’m at a conference. This week I’m at Data and BI Summit in Dublin so my counterpart is picking up activities during the day, and we are touching base during our virtual stand-ups.

StockSnap_DotsSo, I’m still joining dots as I go along.

If you’re interested in getting started with Azure Data Lake, I hope that this gets you some pointers from the design process.

I’ll go into more detail in future blogs but I need to get off writing this blog and do some work!

Azure CosmosDB, Azure Data Lake Analytics and R sessions at Microsoft Data and BI Summit BA

I’m excited to be speaking three times at the Data & BI Summit in Dublin, 24th – 26th April. It’s extra special for me since it will be my first event as a Microsoft Regional Director and also after having been named one of the top 20 women in Artificial Intelligence, Data Science, Machine Learning and Big Data by Big Data Made Simple by the team over at Crayon Data.

I’m speaking on the following topics:

  • R and Power BI
  • Azure CosmosDB and Power BI
  • Azure Data Lake Analytics with Power BI – details to be announced as there have been a few logistic changes.

Here are the details below:

PUGDV07 – R in Power BI for Absolute Beginners

When: Tuesday, April 24 16:00 – 17:00 Where: Liffey Meeting Room 1 (it’s on the first floor) In this session, we will start R right from the beginning, from installing R through to data transformation and integration, through to visualizing data by using R in Power BI. Then, we will move towards powerful but simple to use datatypes in R such as data frames. We will also upgrade our data analysis skills by looking at R data transformation using a powerful set of tools to make things simple: the tidyverse. Then, we will look at integrating our R work into Power BI, and visualizing our data using beautiful visualizations with R and Power BI. Finally, we will share our work by publishing our Power BI project, with our R code, to the Power BI service. We will also look at refreshing our dataset so that our new dashboard has refreshed data. This session is aimed at getting beginners up to speed as gently and quickly as possible. Join this session if you are curious about R and want to know more. If you are already a Power BI expert, join this session to open up a whole new world of Power BI to add to your skill set. If you are new to Power BI, you will still get value from this session since you’ll be able to see a Power BI dashboard being built in an end-to-end solution.

PUGDV11 – Data Analytics with Azure Cosmos Schema-less Data and Power BI

When: Thursday, April 26 15:00 – 16:00

Where: Liffey Meeting Room 5 (it’s on the first floor)

Good news for Developers and Data Analysts; it’s possible to have rapid application development and analytics with the same data source, using Azure Cosmos DB and Power BI.
Azure Cosmos DB is a schemaless database, so how is it possible to analyse and create reports of the data for analytics and Business Intelligence tools? A single Azure Cosmos DB database is great for rapid application development because it can contain JSON documents of various structures, but this needs careful data visualization. In this session, we will analyze and create reports of Azure Cosmos data using Power BI, looking at data from both developer and data analyst aspects.
In this demo-rich session, you will learn about Azure Cosmos, understand its use cases, and see how to work with data in a schemaless Azure Cosmos database for Power BI.

Hope to see you there!