What I’m doing this week at #MSIgnite

I’m delighted to say that I’m doing the Community Reporter role for Microsoft Ignite. This means I get to interview the Microsoft Executive Team, such as Amir Netz, James Phillips and Joseph Sirosh. I have complete stars in my eyes! I don’t often get the chance to speak with them so I’m delighted to get to do that. Also, they are very interesting and they have a lot to say on topics I’m passionate about, so make sure and tune in for those. I’ll release more details about times and how you can watch as soon as I can.

What does a Community Reporter do? During Microsoft Ignite, the Community Reporters will be your go-to’s for live event updates. If you aren’t attending the conference this year, these reporters will be a great way to see what’s happening on-the-ground in Orlando. Check out my content on my blog here and on Twitter and LinkedIn follow me on social to stay up-to-date on all things Microsoft Ignite!

I’d also like to meet some of you so when I get the chance, I’ll tweet out to see if any introverted people fancy sitting at a table with me for breakfast or lunch to talk about all things data.

I am also speaking at Ignite so here are the details:

When? Thursday, September 27 4:30 PM – 5:15 PM
Where? Room W330 West 2 

Artificial intelligence is popularized in fictional films, but the reality is that AI is becoming a part of our daily lives, with virtual assistants like Cortana using the technology to empower productivity and make search easier. What does this mean for organizations that are running the Red Queen’s race not just to win, but to survive in a world where AI is becoming the present and future of technology? How can organizations evolve, adapt, and succeed using AI to stay at the forefront of the competition? What are the potential issues, complications, and benefits that AI could bring to us and our organizations? In this session, we discuss the relevance of AI to organizations, along with the path to success.

 

Microsoft Power BI, Microsoft R and SQL Server are being used to help tackle homelessness in London by providing actionable insights to improve the prevention of homelessness as well as the processes in place to help victims. Join this session to see how Microsoft technologies are helping a data science department to make a difference to the lives of families, by revealing insights into the contributors of homelessness in families in London and the surrounding area. Join this session to understand more about finding stories in data. The case study also demonstrates the practicalities of using Microsoft technologies to help some of the UK’s most vulnerable people using data science for social good.

When? Thursday, September 27 2:15 PM – 3:30 PM
Where? OCCC W222

For people who want to build careers and manage teams, it is crucial to understand diversity and how it impacts your organization. Increasing the role of women in technology has a direct impact on the women working in hi-tech, but the effects can go far beyond that. How do female tech workers influence innovation and product development? How do men benefit from having more women working in technology? Can the presence of women in tech affect a company’s profit? Join a lively discussion on diversity, and hear proactive steps that individuals and companies can take in order to make diversity and inclusion part of the organizational DNA.

One last thing!

Remember to download the Microsoft Ignite app to have your information handy on-the-go!

See you there!

 

 

Fun DataDive with DataKind UK

This weekend, I volunteered with DataKind UK on their Summer DataDive, which took place on the weekend of 28th and 29th July 2018 in the Pivotal London offices in Shoreditch. I had a fantastic, memorable weekend, mixing with around 200 other data scientists.

I’d like to thank the DataKind team for being so inspirational, giving, and kind with their time and skills. I’d like to emphasise my absolute admiration for the Data Ambassadors and the work that they do to lift everyone up.

Why did I do this? DataKind appealed to me since it meant that I could sharpen my data science skills  by pitching in with experts. New learners to Data Science are welcome, and there were also newbies who had some experience of data and wanted to know more. There was room for everyone to contribute, so if you are a newbie, it would be a great way to join in the conversations and learn from experts who love what they can achieve with data. Plus, it’s a great opportunity to mix with real data scientists. This isn’t Poundland data science, and this is not pseudo Data Science. This is the real thing; and I spent two days immersed in real problems using Data Science as a solution. I learned a lot, and I contributed as well. There is a saying that you are the average of your friends, and I needed to get close to more Data Scientists so that I could build on my earlier experience on AI and bring it up-to-date.

I wanted to help a charity, by dedicating my time and skills, to support women and girls who need it. I understand that there are vulnerable men too; but this isn’t about whataboutism. Women and girls are disproportionally affected by issues such as domestic violence and being the victims of sexual crimes, and I wanted to do something practical to help.

Lancashire Women's Centres LogoFor my specific contribution, I was working with a team of 25 other data scientists, we worked on finding insights in data belonging to Lancashire Women’s Centre. The vision of Lancashire Women’s Centre is that all women and girls in Lancashire are valued and treated as equals. Their aim is to empower women and girls to be able to transform their lives by bringing them together to find their voice, share experiences and understanding, develop their knowledge and skills, challenge stereotypes and misconceptions about them so that they can have choices in becoming the individuals they want to be. I share this conviction deeply and I wanted to help.

You may well be thinking that the charity help a small number of women, but that’s not the case at all. They have a real impact in their community. The Lancashire Women’s Centre has helped over 3000 women in the last year. This includes 5807 hours of therapeutic support were accessed by 1154 women and 78 men.  Following therapy: 25% were no longer taking medication, 8% felt the support had helped them find and keep a job, 12% continued to access LWC services to support their recovery.

So what did I do? I can’t share specific details because the data is confidential, and it obviously impacts some of the UK’s most vulnerable women and girls. I will say that the tools used were CoCalc, R, Python, Excel and Tableau and Power BI to work with the data.

DataKind™ brings high-impact organizations dedicated to solving the world’s biggest challenges together with leading data scientists to improve the quality of, access to and understanding of data in the social sector. This leads to better decision-making and greater social impact. Launched in 2011, DataKind leads a community of passionate data scientists, visionary partners and mission-driven organizations with the talent, commitment and energy to use data science in the service of humanity. DataKind is headquartered in New York City and has Chapters in Bangalore, Dublin, San Francisco, Singapore, the UK and Washington DC. More information on DataKind, our programs and our partners can be found on their website: www.datakind.org

Lancashire Women’s Centre

DataKind JenStirrup and Team

I’m the one on the right, wearing orange!

I’m looking forward to the next one!

Modelling your Data in Azure Data Lake

One of my project roles at the moment (I have a few!) is that I am architecting a major Azure implementation for a global brand. I’m also helping with the longer-term ‘vision’ of how that might shape up. I love this part of my job and I’m living my best life doing this piece; I love seeing a project take shape until the end users, whether they are business people or more strategic C-level, get the benefit of the data. At Data Relish, I make your data work for different roles organizations of every purse and every purpose, and I learn a lot from the variety of consulting pieces that I deliver.

If you’ve had even the slightest look at the Azure Portal, you will know that it has oodles of products that you can use in order to create an end-to-end solution. I selected Azure Data Lake for a number of reasons:

  • I have my eye on the Data Science ‘prize’ of doing advanced analytics later on, probably in Azure Databricks as well as Azure Data Lake. I want to make use of existing Apache Spark skills and Azure Data Lake is a neat solution that will facilitate this option.
  • I need a source that will cater for the shape of the data…. or the lack of it….
  • I need a location where the data can be accessed globally since it will be ingesting data from global locations.

In terms of tooling, there is always the Azure Data Lake tools for Visual Studio. You can watch a video on this topic here. But how do you get started with the design approach? So how do I go about the process of designing solutions for the Azure Data Lake? There are many different approaches and I have been implementing Kimball methodologies for years.

cellar

With this particular situation, I will be using the Data Vault methodology. I know that there are different schools of thought but I’ve learned from Dan Lindstedt in particular, who has been very generous in sharing his expertise; here is Dan’s website here. I have delivered this methodology elsewhere previously for an organization who have billions USD turnover, and they are still using the system that I put in place; it was particularly helpful approach for an acquisition scenario, for example.

 

Building a Data Vault starts with the modeling process, and this starts with a view of the existing datamodel of a transactional source system. The purpose of the data vault modelling lifecycle is to produce solutions to the business faster, at lower cost and with less risk, that also have a clear supported afterlife once I’ve moved onto another project for another customer.

 

Data Vault is a database modeling technique where the data is considered to belong to one of three entity types: hubs, links,and satellites:

 

  • Hubs contain the key attributes of business entities (such as geography, products, and customers)
  • Links define the relations between the hubs (for example, customer orders or product categories).

 

  • Satellites contain all other attributes related to hubs or links. Satellites include all attribute change history.

 

The result is an Entity Relationship Diagram (ERD), which consists of Hubs, Links and Satellites. Once I’d settled on this methodology, I needed to hunt around for something to use.

How do you go about designing and using an ERD tool for a Data Vault? I found a few options. For the enterprise, I found  WhereScape® Data Vault Express. That looked like a good option, but I had hoped to use something open-source so other people could adopt it across the team. It wasn’t clear how much it would cost, and, in general, if I have to ask then I can’t afford it! So far, I’ve settled on SQL Power Architect so that I can get the ‘visuals’ across to the customer and the other technical team, including my technical counterpart at the customer who picks up when I’m at a conference. This week I’m at Data and BI Summit in Dublin so my counterpart is picking up activities during the day, and we are touching base during our virtual stand-ups.

StockSnap_DotsSo, I’m still joining dots as I go along.

If you’re interested in getting started with Azure Data Lake, I hope that this gets you some pointers from the design process.

I’ll go into more detail in future blogs but I need to get off writing this blog and do some work!

Useful Data Sources for Demos, Learning and Examples

One question that pops up from time to time is the question over sample datasets for use in self-learning, creating training materials or just for playing with data. I love this question: I learn by actively trying things out too. I love the stories in the data, and this is a great way to find the stories that bring the data to life, and offer real impact.

narrative

Since I deliver real projects with customer impact, I can’t demonstrate any real customer data during any of my presentations since my projects are confidential, so I have three approaches:
  • I use sample data and I have a signed NDA
  • I ask the customer for their data, anonymised and have a signed NDA.
  • I use their live data and have a signed NDA
If the customer elects the first option, then I use sample data from below.
To help you get started, I’ve cobbled together some pointers here, and I hope it’s useful. Please feel free to leave more ideas in the comments.

Entrepreneur

The latest edition of Entrepreneur has an insightful article on open source (frameworks vs libraries) and it has some good pointers to datasets at the bottom of the page. https://www.entrepreneur.com/article/310965 I’ve also pasted them here for you:
Bernard Marr has an updated list of datasets here, on Forbes. I’m not going to steal Marr’s list so I recommend that you go and head over to his page, where you’ll find sixty-plus options.

Data Source Direct Connectivity to R, Python, Ruby and Stata

R has a number of APIs that connect to public datasets e.g. the World Data Bank, which allows connectivity from R, Python, Ruby and Stata.  I used this for my recent demos at the Power BI event in the Netherlands, and it worked sweetly. SO you’d write your script to call the package, embed it in Power BI and it will go and get the data for you. I then create the chart in R, and put it into the Power BI workbook.

Quandl

Quandl offers financial data, and it has a hook so that R can connect directly to it as well.

Kaggle

Kaggle is owned by Google, presumably so that Google can promote Tensorflow. Since people share code based on Kaggle datasets, it’s very easy to pick code, copy it, change it, and see how it works. However, this can be an issue, since you can’t be sure that the code is correct.

Final Note

If you’re teaching or presenting using this data and/or sample code, you can be pretty sure that your training delegates have got access to the Internet too so you need to be sure that you credit people properly.
I am not mostly doing training, although I do training now and again. I am a consultant first and foremost. I’m meta-tracking my time with Desktime and with Trello since I am measuring exactly how I spend my time, and training does not account for a high percentage; project delivery takes the majority of my time.
Although I’m a guest lecturer for the MBA program at the University of Hertfordshire, and I’m going to be a guest lecturer on their MSc Business Analysis and Consultancy course, I do not consider myself a trainer. I am a consultant who sometimes does training as part of a larger project. I haven’t gone down the MCT route because I regard training as part of a bigger consultancy route. I never stop learning, and I don’t expect anyone else to stop learning, either.

literature

Data-Driven or Insights-Driven? Data Analytics vs Data Science

I had an interesting conversation with one of my customers. Through my company Data Relish, I have been leading the Data Science program for some time now, and I was using Team Data Science Process as a backbone to my leadership. I feel I’m fighting the good fight for data, and I like to involve others through the process. It’s great to watch people grow, and get real insights and digital transformation improvements based on these insights.

spirituality

Credit: https://pixabay.com/en/users/geralt-9301/ 

Data science projects are hard, though, and it’s all about expectations. In this case, my customer was curious to know why the current data science project took longer than he expected, and shouldn’t they just exclude the business understanding part of the data science journey? Couldn’t the analytics just clean themselves, or just cut out every piece of data that was a problem?

Being data-driven is all very well, but we need to be open to the insights from business expertise, too.

  When the conversation continued, it became clear that a different data organization had been involved in conversations at some point. Apparently, another organization had told my customer that they needed Data Analytics rather than Data Science, and that the two were mutually exclusive. Data Analytics would give them the insights without involving much, if any, business knowledge, effort, or time. What my customer understood from them was that they didn’t need to match data, clean it and so on; data analytics simply meant analysing columns and rows of data in order to see what relationships and patterns could be found in the data. In essence, the customer should divorce business knowledge from the data, and the data should be analyzed in isolation. The business and the data were regarded as mutually exclusive, and the business side should be silenced in order to let the data speak. Due to these conversations, the customer was concerned about the length of time of the project was taking, and wanted to go down the ‘data analytics’ route, mix up columns, skip data cleaning and matching sources, and he was absolutely certain that insights would fall out of the data. To summarise, there were a few things behind the conversation:

  • business people are concerned about the time taken to do a data science project. They are essentially misled by their experience of Excel; they believe it should be as straightforward and quick as generating a chart in Excel.

  • business people can be easily misdirected by the findings as a result of the data science process, but without being critical about the results themselves. It seems to be enough that a data science project was done; but not that it was right. The fact it is a data science project at all is somehow ‘good enough’.

  • business people can be easily swayed by the terminology. One person said that they were going into decision science, but couldn’t articulate properly what it was, in comparison to data science. That’s another blog for another day, but it’s clear that the terminology is being bandied around and people are not always understanding, defining or delineating what the terms actually mean.

  • business people can equate certainty with doing statistics; they may say that they don’t expect 100% findings, but, in practice, that can go out of the window when the project is underway.

The thing is, this isn’t the first time I’ve had this conversation. I think that being data driven is somewhat misleading, although I do admit to using the term myself; it is very hashtaggable, after all. I think a better phrase is insights driven. If we remove the business interpretation and just throw in data, we can’t be sure if the findings are reasonable. As I responded during this conversation, if we put garbage in, we get garbage out. This is a stock phrase in business intelligence and data warehousing, and it also applies to data science. There needs to be a balance; we can be open to new ideas. Our business subject matter expertise can help to shortcut the project by working with the data – not against the data. It helps to avoid the potential of going down rabbitholes because the data said so. The insights from the business can help to make the stories in the data more clear, whilst being open to new insights from the data.

In other words, data and the business should not be mutually exclusive.

How did it end? We proceeded as normal, and as per the Data Science plan I’d put in place. Fortunately, there were strong voices from the business, who wanted to be included at all stages. I think that we are getting farther, faster, as a unified team, all moving in the same direction.  We need to question. Data Science is like April Fools’ Day, every day; don’t believe everything you read. Otherwise, we will never see the wood for the trees. wood

Credit: https://pixabay.com/en/users/geralt-9301/ 

Book Review: Grokking Algorithms: An Illustrated Guide For Programmers and Other Curious People

Grokking Algorithms An Illustrated Guide For Programmers and Other Curious PeopleGrokking Algorithms An Illustrated Guide For Programmers and Other Curious People by Aditya Y. Bhargava

My rating: 5 of 5 stars

I’ve just finished reading the Manning book called Grokking Algorithms An Illustrated Guide For Programmers and Other Curious People

This is a very readable book, with great diagrams and a very visual style. I recommend this book for anyone who wants to understand more about algorithms.
This is an excellent book for the budding data scientist who wants to get past the bittiness of learning pieces of open source or proprietary software here and there, and wants to learn what the algorithms actually mean in practice. It’s fairly easy to get away with looking like a real Data Scientist if you know bits of R or Python, I think, but when someone scratches the surface of that vision, it can become very apparent that the whole theory and deeper understanding can be missing. This book will help people to bridge the gap from learning bits here and there, to learning what the algorithms actually mean in practice.
Recommended. I’m expecting to find that people might ‘pinch’ the diagrams but I’d strongly suggest that they contact the author and credit appropriately.
I’d recommend this book, for sure. Enjoy!

View all my reviews

Doing the Do: the best resource for learning new Business Intelligence and Data Science technology

As a consultant, I get parachuted into difficult problems every day. Often, I figure it out because I have to, and I want to. Usually, nobody else can do it other than me – they are all keeping the fires lit. I get to do the thorny problems that get left burning quietly. I love the challenge of these successes!

How do you get started? The online and offline courses, books, MOOCs, papers, blogs and the forums help, of course. I regularly use several resources for learning but my number one source of learning is:

Doing the ‘do’ – working on practical projects, professional or private

Nothing beats hands-on experience. 

How do you get on the project ladder? Without experience, you can’t get started. So you end up in this difficult situation where you can’t get started, without experience.

Volunteer your time in the workplace – or out of it. It could be a professional project or your ‘data science citizen’ project that you care about. Your boss wants her data? Define the business need, and identify what she actually wants. If it helps, prototype to elicit the real need. Volunteer to try and get the data for her. Take a sample and just get started with descriptive statistics. Look at the simple things first.

Not sure of the business question? Try the AzureML Cheat Sheet for help.

machine-learning-algorithm-cheat-sheet-small_v_0_6-01

Working with dat means that you will be challenged with real situations and you will read and learn more, because you have to do it in order to deliver.

In my latest AzureML course with Opsgility, I take this practical, business-centred approach for AzureML. I show you how to take data, difficult business questions and practical problems, and I show you how to create a successful outcome; even if that outcome is a failed model, it still makes you revise the fundamental business question. It’s a safe environment to get experience.

So, if this is you – what’s the sequence? There are a few sequences or frameworks to try:

  • TDSP (Microsoft)
  • KDD
  • CRISP-DM
  • SEMMA

The ‘headline’ of each framework is given below, as a reference point, so you can see for yourself that they are very different. The main thing is to simply get started.

Team Data Science Process (Microsoft)

tdsp-lifecycle

 

KDD

kdd

 

CRISP-DM

330px-crisp-dm_process_diagram

 

SEMMA

metodo-semma

It’s important not to get too wrapped up on comparing models; this could be analysis paralysis, and that’s not going to help.

I’d suggest you start with the TDSP because of the fine resources, and take it from there.

I’d be interested in your approaches, so please do leave comments below.

Good  luck!