Modelling your Data in Azure Data Lake

One of my project roles at the moment (I have a few!) is that I am architecting a major Azure implementation for a global brand. I’m also helping with the longer-term ‘vision’ of how that might shape up. I love this part of my job and I’m living my best life doing this piece; I love seeing a project take shape until the end users, whether they are business people or more strategic C-level, get the benefit of the data. At Data Relish, I make your data work for different roles organizations of every purse and every purpose, and I learn a lot from the variety of consulting pieces that I deliver.

If you’ve had even the slightest look at the Azure Portal, you will know that it has oodles of products that you can use in order to create an end-to-end solution. I selected Azure Data Lake for a number of reasons:

  • I have my eye on the Data Science ‘prize’ of doing advanced analytics later on, probably in Azure Databricks as well as Azure Data Lake. I want to make use of existing Apache Spark skills and Azure Data Lake is a neat solution that will facilitate this option.
  • I need a source that will cater for the shape of the data…. or the lack of it….
  • I need a location where the data can be accessed globally since it will be ingesting data from global locations.

In terms of tooling, there is always the Azure Data Lake tools for Visual Studio. You can watch a video on this topic here. But how do you get started with the design approach? So how do I go about the process of designing solutions for the Azure Data Lake? There are many different approaches and I have been implementing Kimball methodologies for years.

cellar

With this particular situation, I will be using the Data Vault methodology. I know that there are different schools of thought but I’ve learned from Dan Lindstedt in particular, who has been very generous in sharing his expertise; here is Dan’s website here. I have delivered this methodology elsewhere previously for an organization who have billions USD turnover, and they are still using the system that I put in place; it was particularly helpful approach for an acquisition scenario, for example.

 

Building a Data Vault starts with the modeling process, and this starts with a view of the existing datamodel of a transactional source system. The purpose of the data vault modelling lifecycle is to produce solutions to the business faster, at lower cost and with less risk, that also have a clear supported afterlife once I’ve moved onto another project for another customer.

 

Data Vault is a database modeling technique where the data is considered to belong to one of three entity types: hubs, links,and satellites:

 

  • Hubs contain the key attributes of business entities (such as geography, products, and customers)
  • Links define the relations between the hubs (for example, customer orders or product categories).

 

  • Satellites contain all other attributes related to hubs or links. Satellites include all attribute change history.

 

The result is an Entity Relationship Diagram (ERD), which consists of Hubs, Links and Satellites. Once I’d settled on this methodology, I needed to hunt around for something to use.

How do you go about designing and using an ERD tool for a Data Vault? I found a few options. For the enterprise, I found  WhereScape® Data Vault Express. That looked like a good option, but I had hoped to use something open-source so other people could adopt it across the team. It wasn’t clear how much it would cost, and, in general, if I have to ask then I can’t afford it! So far, I’ve settled on SQL Power Architect so that I can get the ‘visuals’ across to the customer and the other technical team, including my technical counterpart at the customer who picks up when I’m at a conference. This week I’m at Data and BI Summit in Dublin so my counterpart is picking up activities during the day, and we are touching base during our virtual stand-ups.

StockSnap_DotsSo, I’m still joining dots as I go along.

If you’re interested in getting started with Azure Data Lake, I hope that this gets you some pointers from the design process.

I’ll go into more detail in future blogs but I need to get off writing this blog and do some work!

14 thoughts on “Modelling your Data in Azure Data Lake

  1. I’d love to see new posts on the Data Vault methodology. At a cursory glance, I had trouble understanding it – perhaps because I’m trying to relate everything to Kimball.

    1. Hello Simon, yes, I translate everything into ‘Kimballese’ first and then I take it from there. I will write more about this ‘journey’ in the near future. I think it’s interesting, and I am seeing more Data Vault as people move into data lakes and so on.

  2. Perhaps you should look at road at it is the company standard

      1. 🙂 Thanks for getting in touch John. I didn’t manage to put down all of the options and I might cover it separately. If my posts are too lengthy, people don’t read them! It’s probably worth topic on its own. I hope you’re enjoying BUILD and I’ll look forward to hearing about it when I see you next.

  3. Hello Jen. My name is Juan Bengochea and I am the architect for a project that is building a modern data platform that leverages the data lake concept. It is cloud based and hosted on Azure. The data lake component of the platform resides within ADLS. We have just completed building the raw/ingestion zone and are in the process of designing what we are calling the standardized/consumption zone. The data vault modeling approach presents an attractive option for this area as it will be based on semantic objects/domains. If possible I would like to pick your brain on a couple of aspects of the data vault modeling approach as it pertains to ADLS. Would that be possible. If so, I am happy to follow whatever channels are required by Microsoft. Perhaps going through our Microsoft account representative would be the best route? Thanks in advance.

  4. Hello,
    I was reading this and i am also designing a DV in AZURE DW, using as “ETL” tool Databricks. Do you have any update about our project and your challenges?
    Thanks,

  5. Hi Jen, I have a fundamental question plaguing me ever since I joined a team that does Data Modelling using the Data Vault method, while storing the data as Hub, Link and Satellite tables in Hive. While it may be really handy to make sense of the data and to be able to view the linkages between data from various data sources when using the data modelling, my question is in Big Data systems, such as Azure data lake, does it make sense to store the data into tables designated as Hub, Link, Satellite?
    Quite often our consumers seems to use joins over these tables to get at near subset of the original dataset. Does it not then make sense to store in a hive table equivalent to the original data set in Azure data lake?
    Is there any advantage is chopping the data into hub link satellite tables for storage?
    Shirsha

    1. Shirsha, I really think that your team can help you with these questions and relate them to the context of your organization.
      I advise you to read the official Data Vault documentation and perhaps do the certification. Your questions are very wide-ranging and I think you’d benefit from going through the official materials. I suggest you read something by Dan Linstedt https://www.amazon.co.uk/Building-Scalable-Data-Warehouse-Vault/dp/0128025107

  6. Hi Jen, I have read up the entire documentation of Data Vault! To me it still seems like storing the data the way it is modelled in DV is not a performant way of storing things in the Big Data systems. I was hoping you have counters to that. The consumption pattern repeatedly seeks to join the data across multiple tables, when they could’ve very well stored that way. It seems like the data modelling helps with data understanding but should not be the way to store the data, particularly on Big Data systems. Does your experience with storing it that way in Azure Data Lake prove otherwise?

    1. Hello there, I don’t really understand your question. You’ve specified that you’ve recently joined a team that does Data Vault and I think you should pick it up with them. I have not seen any problems with Azure Data Lake performance using this methodology.
      Can you please go back and pick up with your team? It sounds as if you are questioning their choice of technology and modelling and I can’t give you the reasons why they chose that route. Your questions are not a quick answer and you probably should go and ask your team for an indepth discussion.
      Regards, Jen

  7. Hi Jen
    I am a quite interested in your learning from DataValut on ADLS. I did worked on DV previously on Oralce / Terradata systems and results were promising but not out of the world or as compared to modelling we did as Kimball with separate history / version tables. These were relational technology and ADLS is more of Fiie based system, where “Data Modelling” is not main concern but its quite important to get “Data Organisation” right.

Comments are closed.