Modelling your Data in Azure Data Lake

One of my project roles at the moment (I have a few!) is that I am architecting a major Azure implementation for a global brand. I’m also helping with the longer-term ‘vision’ of how that might shape up. I love this part of my job and I’m living my best life doing this piece; I love seeing a project take shape until the end users, whether they are business people or more strategic C-level, get the benefit of the data. At Data Relish, I make your data work for different roles organizations of every purse and every purpose, and I learn a lot from the variety of consulting pieces that I deliver.

If you’ve had even the slightest look at the Azure Portal, you will know that it has oodles of products that you can use in order to create an end-to-end solution. I selected Azure Data Lake for a number of reasons:

  • I have my eye on the Data Science ‘prize’ of doing advanced analytics later on, probably in Azure Databricks as well as Azure Data Lake. I want to make use of existing Apache Spark skills and Azure Data Lake is a neat solution that will facilitate this option.
  • I need a source that will cater for the shape of the data…. or the lack of it….
  • I need a location where the data can be accessed globally since it will be ingesting data from global locations.

In terms of tooling, there is always the Azure Data Lake tools for Visual Studio. You can watch a video on this topic here. But how do you get started with the design approach? So how do I go about the process of designing solutions for the Azure Data Lake? There are many different approaches and I have been implementing Kimball methodologies for years.

cellar

With this particular situation, I will be using the Data Vault methodology. I know that there are different schools of thought but I’ve learned from Dan Lindstedt in particular, who has been very generous in sharing his expertise; here is Dan’s website here. I have delivered this methodology elsewhere previously for an organization who have billions USD turnover, and they are still using the system that I put in place; it was particularly helpful approach for an acquisition scenario, for example.

 

Building a Data Vault starts with the modeling process, and this starts with a view of the existing datamodel of a transactional source system. The purpose of the data vault modelling lifecycle is to produce solutions to the business faster, at lower cost and with less risk, that also have a clear supported afterlife once I’ve moved onto another project for another customer.

 

Data Vault is a database modeling technique where the data is considered to belong to one of three entity types: hubs, links,and satellites:

 

  • Hubs contain the key attributes of business entities (such as geography, products, and customers)
  • Links define the relations between the hubs (for example, customer orders or product categories).

 

  • Satellites contain all other attributes related to hubs or links. Satellites include all attribute change history.

 

The result is an Entity Relationship Diagram (ERD), which consists of Hubs, Links and Satellites. Once I’d settled on this methodology, I needed to hunt around for something to use.

How do you go about designing and using an ERD tool for a Data Vault? I found a few options. For the enterprise, I found  WhereScape® Data Vault Express. That looked like a good option, but I had hoped to use something open-source so other people could adopt it across the team. It wasn’t clear how much it would cost, and, in general, if I have to ask then I can’t afford it! So far, I’ve settled on SQL Power Architect so that I can get the ‘visuals’ across to the customer and the other technical team, including my technical counterpart at the customer who picks up when I’m at a conference. This week I’m at Data and BI Summit in Dublin so my counterpart is picking up activities during the day, and we are touching base during our virtual stand-ups.

StockSnap_DotsSo, I’m still joining dots as I go along.

If you’re interested in getting started with Azure Data Lake, I hope that this gets you some pointers from the design process.

I’ll go into more detail in future blogs but I need to get off writing this blog and do some work!

Useful Data Sources for Demos, Learning and Examples

One question that pops up from time to time is the question over sample datasets for use in self-learning, creating training materials or just for playing with data. I love this question: I learn by actively trying things out too. I love the stories in the data, and this is a great way to find the stories that bring the data to life, and offer real impact.

narrative

Since I deliver real projects with customer impact, I can’t demonstrate any real customer data during any of my presentations since my projects are confidential, so I have three approaches:
  • I use sample data and I have a signed NDA
  • I ask the customer for their data, anonymised and have a signed NDA.
  • I use their live data and have a signed NDA
If the customer elects the first option, then I use sample data from below.
To help you get started, I’ve cobbled together some pointers here, and I hope it’s useful. Please feel free to leave more ideas in the comments.

Entrepreneur

The latest edition of Entrepreneur has an insightful article on open source (frameworks vs libraries) and it has some good pointers to datasets at the bottom of the page. https://www.entrepreneur.com/article/310965 I’ve also pasted them here for you:
Bernard Marr has an updated list of datasets here, on Forbes. I’m not going to steal Marr’s list so I recommend that you go and head over to his page, where you’ll find sixty-plus options.

Data Source Direct Connectivity to R, Python, Ruby and Stata

R has a number of APIs that connect to public datasets e.g. the World Data Bank, which allows connectivity from R, Python, Ruby and Stata.  I used this for my recent demos at the Power BI event in the Netherlands, and it worked sweetly. SO you’d write your script to call the package, embed it in Power BI and it will go and get the data for you. I then create the chart in R, and put it into the Power BI workbook.

Quandl

Quandl offers financial data, and it has a hook so that R can connect directly to it as well.

Kaggle

Kaggle is owned by Google, presumably so that Google can promote Tensorflow. Since people share code based on Kaggle datasets, it’s very easy to pick code, copy it, change it, and see how it works. However, this can be an issue, since you can’t be sure that the code is correct.

Final Note

If you’re teaching or presenting using this data and/or sample code, you can be pretty sure that your training delegates have got access to the Internet too so you need to be sure that you credit people properly.
I am not mostly doing training, although I do training now and again. I am a consultant first and foremost. I’m meta-tracking my time with Desktime and with Trello since I am measuring exactly how I spend my time, and training does not account for a high percentage; project delivery takes the majority of my time.
Although I’m a guest lecturer for the MBA program at the University of Hertfordshire, and I’m going to be a guest lecturer on their MSc Business Analysis and Consultancy course, I do not consider myself a trainer. I am a consultant who sometimes does training as part of a larger project. I haven’t gone down the MCT route because I regard training as part of a bigger consultancy route. I never stop learning, and I don’t expect anyone else to stop learning, either.

literature

Data-Driven or Insights-Driven? Data Analytics vs Data Science

I had an interesting conversation with one of my customers. Through my company Data Relish, I have been leading the Data Science program for some time now, and I was using Team Data Science Process as a backbone to my leadership. I feel I’m fighting the good fight for data, and I like to involve others through the process. It’s great to watch people grow, and get real insights and digital transformation improvements based on these insights.

spirituality

Credit: https://pixabay.com/en/users/geralt-9301/ 

Data science projects are hard, though, and it’s all about expectations. In this case, my customer was curious to know why the current data science project took longer than he expected, and shouldn’t they just exclude the business understanding part of the data science journey? Couldn’t the analytics just clean themselves, or just cut out every piece of data that was a problem?

Being data-driven is all very well, but we need to be open to the insights from business expertise, too.

  When the conversation continued, it became clear that a different data organization had been involved in conversations at some point. Apparently, another organization had told my customer that they needed Data Analytics rather than Data Science, and that the two were mutually exclusive. Data Analytics would give them the insights without involving much, if any, business knowledge, effort, or time. What my customer understood from them was that they didn’t need to match data, clean it and so on; data analytics simply meant analysing columns and rows of data in order to see what relationships and patterns could be found in the data. In essence, the customer should divorce business knowledge from the data, and the data should be analyzed in isolation. The business and the data were regarded as mutually exclusive, and the business side should be silenced in order to let the data speak. Due to these conversations, the customer was concerned about the length of time of the project was taking, and wanted to go down the ‘data analytics’ route, mix up columns, skip data cleaning and matching sources, and he was absolutely certain that insights would fall out of the data. To summarise, there were a few things behind the conversation:

  • business people are concerned about the time taken to do a data science project. They are essentially misled by their experience of Excel; they believe it should be as straightforward and quick as generating a chart in Excel.

  • business people can be easily misdirected by the findings as a result of the data science process, but without being critical about the results themselves. It seems to be enough that a data science project was done; but not that it was right. The fact it is a data science project at all is somehow ‘good enough’.

  • business people can be easily swayed by the terminology. One person said that they were going into decision science, but couldn’t articulate properly what it was, in comparison to data science. That’s another blog for another day, but it’s clear that the terminology is being bandied around and people are not always understanding, defining or delineating what the terms actually mean.

  • business people can equate certainty with doing statistics; they may say that they don’t expect 100% findings, but, in practice, that can go out of the window when the project is underway.

The thing is, this isn’t the first time I’ve had this conversation. I think that being data driven is somewhat misleading, although I do admit to using the term myself; it is very hashtaggable, after all. I think a better phrase is insights driven. If we remove the business interpretation and just throw in data, we can’t be sure if the findings are reasonable. As I responded during this conversation, if we put garbage in, we get garbage out. This is a stock phrase in business intelligence and data warehousing, and it also applies to data science. There needs to be a balance; we can be open to new ideas. Our business subject matter expertise can help to shortcut the project by working with the data – not against the data. It helps to avoid the potential of going down rabbitholes because the data said so. The insights from the business can help to make the stories in the data more clear, whilst being open to new insights from the data.

In other words, data and the business should not be mutually exclusive.

How did it end? We proceeded as normal, and as per the Data Science plan I’d put in place. Fortunately, there were strong voices from the business, who wanted to be included at all stages. I think that we are getting farther, faster, as a unified team, all moving in the same direction.  We need to question. Data Science is like April Fools’ Day, every day; don’t believe everything you read. Otherwise, we will never see the wood for the trees. wood

Credit: https://pixabay.com/en/users/geralt-9301/ 

Book Review: Grokking Algorithms: An Illustrated Guide For Programmers and Other Curious People

Grokking Algorithms An Illustrated Guide For Programmers and Other Curious PeopleGrokking Algorithms An Illustrated Guide For Programmers and Other Curious People by Aditya Y. Bhargava

My rating: 5 of 5 stars

I’ve just finished reading the Manning book called Grokking Algorithms An Illustrated Guide For Programmers and Other Curious People

This is a very readable book, with great diagrams and a very visual style. I recommend this book for anyone who wants to understand more about algorithms.
This is an excellent book for the budding data scientist who wants to get past the bittiness of learning pieces of open source or proprietary software here and there, and wants to learn what the algorithms actually mean in practice. It’s fairly easy to get away with looking like a real Data Scientist if you know bits of R or Python, I think, but when someone scratches the surface of that vision, it can become very apparent that the whole theory and deeper understanding can be missing. This book will help people to bridge the gap from learning bits here and there, to learning what the algorithms actually mean in practice.
Recommended. I’m expecting to find that people might ‘pinch’ the diagrams but I’d strongly suggest that they contact the author and credit appropriately.
I’d recommend this book, for sure. Enjoy!

View all my reviews

Doing the Do: the best resource for learning new Business Intelligence and Data Science technology

As a consultant, I get parachuted into difficult problems every day. Often, I figure it out because I have to, and I want to. Usually, nobody else can do it other than me – they are all keeping the fires lit. I get to do the thorny problems that get left burning quietly. I love the challenge of these successes!

How do you get started? The online and offline courses, books, MOOCs, papers, blogs and the forums help, of course. I regularly use several resources for learning but my number one source of learning is:

Doing the ‘do’ – working on practical projects, professional or private

Nothing beats hands-on experience. 

How do you get on the project ladder? Without experience, you can’t get started. So you end up in this difficult situation where you can’t get started, without experience.

Volunteer your time in the workplace – or out of it. It could be a professional project or your ‘data science citizen’ project that you care about. Your boss wants her data? Define the business need, and identify what she actually wants. If it helps, prototype to elicit the real need. Volunteer to try and get the data for her. Take a sample and just get started with descriptive statistics. Look at the simple things first.

Not sure of the business question? Try the AzureML Cheat Sheet for help.

machine-learning-algorithm-cheat-sheet-small_v_0_6-01

Working with dat means that you will be challenged with real situations and you will read and learn more, because you have to do it in order to deliver.

In my latest AzureML course with Opsgility, I take this practical, business-centred approach for AzureML. I show you how to take data, difficult business questions and practical problems, and I show you how to create a successful outcome; even if that outcome is a failed model, it still makes you revise the fundamental business question. It’s a safe environment to get experience.

So, if this is you – what’s the sequence? There are a few sequences or frameworks to try:

  • TDSP (Microsoft)
  • KDD
  • CRISP-DM
  • SEMMA

The ‘headline’ of each framework is given below, as a reference point, so you can see for yourself that they are very different. The main thing is to simply get started.

Team Data Science Process (Microsoft)

tdsp-lifecycle

 

KDD

kdd

 

CRISP-DM

330px-crisp-dm_process_diagram

 

SEMMA

metodo-semma

It’s important not to get too wrapped up on comparing models; this could be analysis paralysis, and that’s not going to help.

I’d suggest you start with the TDSP because of the fine resources, and take it from there.

I’d be interested in your approaches, so please do leave comments below.

Good  luck!

Guess who is appearing in Joseph Sirosh’s PASS Keynote?

This girl! I am super excited and please allow me to have one little SQUUEEEEEEE! before I tell you what’s happening. Now, this is a lifetime achievement for me, and I cannot begin to tell you how absolutely and deeply honoured I am. I am still in shock!

I am working really hard on my demo and….. I am not going to tell you what it is. You’ll have to watch it. Ok, enough about me and all I’ll say is two things: it’s something that’s never been done at PASS Summit before and secondly, watch the keynote because there may be some discussion about….. I can’t tell you what… only that, it’s a must-watch, must-see, must do keynote event.

We are in a new world of Data and Joseph Sirosh and the team are leading the way. Watching the keynote will mean that you get the news as it happens, and it will help you to keep up with the changes. I do have some news about Dr David DeWitt’s Day Two keynote… so keep watching this space. Today I’d like to talk about the Day One keynote with the brilliant Joseph Sirosh, CVP of Microsoft’s Data Group.

Now, if you haven’t seen Joseph Sirosh present before, then you should. I’ve put some of his earlier sessions here and I recommend that you watch them.

Ignite Conference Session

MLDS Atlanta 2016 Keynote

I hear you asking… what am I doing in it? I’m keeping it a surprise! Well, if you read my earlier blog, you’ll know I transitioned from Artificial Intelligence into Business Intelligence and now I do a hybrid of AI and BI. As a Business Intelligence professional, my customers will ask me for advice when they can’t get the data that they want. Over the past few years, the ‘answer’ to their question has gone far, far beyond the usual on-premise SQL Server, Analysis Services, SSRS combo.

We are now in a new world of data. Join in the fun!

Customers sense that there is a new world of data. The ‘answer’ to the question Can you please help me with my data?‘ is complex, varied and it’s very much aimed at cost sensitivities, too. Often, customers struggle with data because they now have a Big Data problem, or a storage problem, or a data visualisation access problem. Azure is very neat because it can cope with all of these issues. Now, my projects are Business Intelligence and Business Analytics projects… but they are also ‘move data to the cloud’ projects in disguise, and that’s in response to the customer need. So if you are Business Intelligence professional, get enthusiastic about the cloud because it really empowers you with a new generation of exciting things you can do to please your users and data consumers.

As a BI or an analytics professional, cloud makes data more interesting and exciting. It means you can have a lot more data, in more shapes and sizes and access it in different ways. It also means that you can focus on what you are good at, and make your data estate even more interesting by augmenting it with cool features in Azure. For example, you could add in more exciting things such as Apache Tika library as a worker role in Azure to crack through PDFs and do interesting things with the data in there. If you bring it into SSIS, then you can tear it up and down again when you don’t need it.

I’d go as far as to say that, if you are in Business Intelligence at the moment, you will need to learn about cloud sooner or later. Eventually, you’re going to run into Big Data issues. Alternatively, your end consumers are going to want their data on a mobile device, and you will want easy solutions to deliver it to them. Customers are interested in analytics and the new world of data and you will need to hop on the Azure bus to be a part of it.

The truth is; Joseph Sirosh’s keynotes always contain amazing demos. (No pressure, Jen, no pressure….. ) Now, it’s important to note that these demos are not ‘smoke and mirrors’….

The future is here, now. You can have this technology too.

It doesn’t take much to get started, and it’s not too far removed from what you have in your organisation. AzureML and Power BI have literally hundreds of examples. I learned AzureML looking at the following book by Wee-Hyong Tok and others, so why not download a free book sample?

https://read.amazon.co.uk/kp/card?asin=B00MBL261W&preview=inline&linkCode=kpe&ref_=cm_sw_r_kb_dp_c54ayb2VHWST4

How do you proceed? Well, why not try a little homespun POC with some of your own data to learn about it, and then show your boss. I don’t know about you but I learn by breaking things, and I break things all the time when I’m  learning. You could download some Power BI workbooks, use the sample data and then try to recreate them, for example. Or, why not look at the community R Gallery and try to play with the scripts. you broke something? no problem! Just download a fresh copy and try again. You’ll get further next time.

I hope to see you at the PASS keynote! To register, click here: http://www.sqlpass.org/summit/2016/Sessions/Keynotes.aspx 

Jen’s Diary: What does Microsoft’s recent acquisitions of Revolution Analytics mean for PASS?

Caveat: This blog does not represent the views of PASS or the PASS Board. These opinions are solely mine.

The world of data and analytics keeps heating up. Tableau, for example, keeps growing and winning. In fact, Tableau continues to grow total and licence revenue 75% year over year, with its total revenue grew to $142.9 million in the FY4 of 2014.There’s a huge shift in the market towards analytics, and it shows in the numbers. Lets take a look at some of the interesting things Microsoft have done recently, and see how it relates to PASS:

  • Acquired Revolution Analytics, an R-language-focused advanced analytics firm, will bring customers tools for prediction and big-data analytics.
  • Acquired Datazen, a provider of data visualization and key performance indicator data on Windows, iOS and Android devices. This is great from the cross-platform perspective, and we’ll look at this in a later blog. For now, let’s discuss Revolution and Microsoft.

Why it was good for Microsoft to acquire Revolution Analytics

The acquisition shows that Microsoft is bolstering its portfolio of advanced analytics tools. R is becoming increasingly common as a skill set, and businesses are more comfortable about using open source technology such as R. It is also accessible software, and a great tool for doing analytics. I’m hoping that this will help organisations to recognise and conduct advanced analytics, and it will improve the analytics capability in HDInsight.

Microsoft has got pockets of advanced analytics capabilities built into Microsoft SQL Server, and in particular, SQL Server Analysis Services, and also in the SQL Server Parallel Data Warehouse (PDW). Microsoft also has the Azure Machine Learning Service (Azure ML) which uses R in MLStudio. However, it does not have an advanced analytics studio, and the approach can come across as piecemeal for those who are new to it. The acquisition of Revolution Analytics will give Microsoft on-premises tools for data scientists, data miners, and analysts, and cloud and big data analytics for the same crowd.

Here’s what I’d like Microsoft to do with R:

  • Please give some love to SSRS by infusing it with R. There is a codeplex download that will help you to produce R visualisations in SSRS. I’d like to see more and easier integration, which doesn’t require a lot of hacking about.
  • Power Query has limited statistical capability at the moment. It could be expanded to include R. I am not keen for Microsoft to develop yet another programming language and R could be a part of the Power Query story.
  • Self-service analytics. We’ve all seen the self-service business intelligence communications. What about helping people to self-serve analytics as well, once they’ve cracked self-service BI? I’d like to see R made easier to use for everyone. I sense that will be a long way off, but it is an opportunity.
  • Please change the R facility in MLStudio. It’s better to use RStudio to create your R script, then upload it.

What issues do I see in the Revolution Analytics acquisition?

Microsoft is a huge organisation. Where will it sit within the organisation? Any acquisition involves a change management process. Change management is always hard. R touches different parts of the technology stack. This could be further impacted by the open source model that R has been developed under. Fortunately Revolution seem to have thought of some of these issues already: how does it scale, for example? This acquisition will need to be carefully envisioned, communicated and implemented, and I really do wish them every success with it.

What does this mean for PASS?

I hold the PASS Business Analytics Portfolio, and our PASS Business Analytics Conference is being held next week. Please use code BFFJS to get the conference for a discount rate, if you are interested in going.

I think the PASS strategy of becoming more data platform focused is the right one. PASS exist to provide technical community education to data professionals, and I think PASS are well placed to move on the analytics journey that we see in the industry. I already held a series on R for the Data Science Virtual Chapter, and I’m confident you’ll see more material on this and related topics. There are sessions on R at the PASS BA Conference as well. The addition of Revolution Analytics and Datazen is great for Microsoft, and it means that the need for learning in these areas is more urgent, not less. That does not mean that i think that everyone should learn analytics. I don’t. However, I do think PASS can help those who are part of the journey, if they want (or need) to be.

I’m personally glad PASS are doing the PASS Business Analytics Conference because I believe it is a step in the right direction, in the analytics journey we see for the people who want to learn analytics, the businesses who want to use it, and the burgeoning technology. I agree with Brent Ozar ( b / t ) in that I don’t think that the role of the DBA is going away. I do think that, for small / medium businesses, some folks might find that they become the ‘data’ person rather than the DBA being a skill on its own. I envisage that PASS will continue to serve the DBA-specialist-guru as well as the BI-to-analytics people, as well as those who become the ‘one-stop-shop’ for everything data in their small organisation (DBA / BA / Analytics), as well as the DBA-and-Cloud person. It’s about giving people opportunity to learn what they want and need to learn, in order to keep up with the rate of change we see in the industry.

Please feel free to comment below.

Your friend,

Jen Stirrup

x