Want to learn how to light up Big Data Analytics using Apache Spark in Azure?

Businesses struggle with many different aspects of data and technology. It can be difficult to know what technology to choose. Also, it can be hard to know where to turn, when there are so many buzzwords in the mix: analytics, big data and open source. My session at PASS Summit is essentially talking about these things, using Azure and Apache Spark as a backdrop.

Vendors tend to tell their version of events, as you might expect, so it becomes really hard to get advice on how to have a proper blueprint to get you up and running. In this session, I will examine strategies for using open source technologies to improve existing common Business Intelligence issues, using Apache Spark as our backdrop to delivering open source Big Data analytics.

Once we have looked at the strategies, we will look at your choices on how to make the most of the open source technology. For example, how can we make the most of the investment? How can we speed things up? How can we manipulate data?

itoa-illustration-1200x572

These business questions are translated into technical terms. We will explore how we can parallelize your computations across nodes of a Hadoop cluster, once your clusters are set up. We will look combine use of SparkR for data manipulation with ScaleR for model development in Hadoop Spark. At the time of writing, this scenario requires that you maintain separate Spark sessions, only running one session at a time, and exchange data via CSV files. Hopefully, in the near future, we’ll see an R Server release, when SparkR and ScaleR can share a Spark session and so share Spark DataFrames. Hopefully that’s out prior to the session so we can see it, but, nevertheless, we will still look at how ScaleR works with Spark and how we can use Sparkly and SparkR within a ScaleR workflow.

Join my session at PASS Summit 2017 to learn more about open source with Azure for Business Intelligence, with a focus on Azure Spark.

An MVP for 7 years – what’s next?

a10a2d58-02cf-41cb-ab94-f0f31d539c43-original

 

Well, it’s been a hard year, for a number of reasons, but I appear to have come out the other side.

Looking forward, what comes next?

New things!

dscn08212

As some of you know, I care deeply about diversity in technology.

I have set up a Diversity Charter Slack channel to encourage user group leaders to talk about diversity and how we can encourage user group leaders to think about these issues.

I have set up an effort to have a Diversity Charter that user groups can use. I need help with things like logos, thoughts on a website and so on – so please do help if  you can!

The Diversity Charter looks like this, so far:

We believe that all members of the technical community are equally important.
We are part a tech community where we value a diverse network, and learn and share from one another:
regardless of age,
regardless of colour,
regardless of their ethnicity,
regardless of their religion or beliefs,
regardless of disability,
regardless of gender or sexual orientation,
regardless of their race,
regardless of their ability or lack of ability,
regardless of nationality or accent.
We are a diverse tech community where we are all individuals with differences, but we are all members and we can all learn from each other.

I look forward to your thoughts. Please do join my Slack channel diversitycharter.slack.com/ or ping me an email at diversity@datarelish.com in order to get an invite.

I will continue to help share my knowledge through blogging, writing, speaking, presenting, and increase my online presence. At heart, I am a content producer. It’s what I do, and it’s what I love.

I will continue working hard on the PASS Board. I just attended a Board meeting, which took place two nights during the week in the PST timezone. I am based in the GMT timezone, so I had a few very late nights or very early mornings, depending on your view. My recent focus is as a ‘trusted advisor’ capacity so I am helping to drive the new developer initiatives and business analytics initiatives in a strategic manner.

To keep the community fresh, I will continue to try to help to develop other community leaders. I have nominated a lot of people for the MVP Award this year, including David Moss,  Tomaz Kastrun and other people that I won’t mention, because they weren’t successful this time.

 

 

 

What’s wrong with CRISP-DM, and is there an alternative?

Many people, including myself, have discussed CRISP-DM in detail. However, I didn’t feel totally comfortable with it, for a number of reasons which I list below. Now I had raised a problem, I needed to find a solution and that’s where the Microsoft Team Data Science Process comes in. Read on for more detail!

  • What is CRISP-DM?
  • What’s wrong with CRISP-DM?
  • How does technology impinge on CRISP-DM?
  • What comes after CRISP-DM? Enter the Team Data Science Process?
  • What is the Team Data Science Process?

 

What is CRISP-DM?

One common methodology is the CRISP-DM methodology (The Modeling Agency). The Cross Industry Standard Process for Data Mining or (CRISP-DM) model as it is known, is a process framework for designing, creating, building, testing, and deploying machine learning solutions. The process is arranged into six phases. The phases can be seen in the following diagram:

crisp-dm-300x293

The phases are described below

 Phase  Description
Business
Understanding / Data Understanding
The first phase looks at the machine learning
solution from the business standpoint, rather than a technical standpoint.
Once the business concept is defined, the Data Understanding phase focuses on
data familiarity and collation.
Data Preparation In this stage, data will be cleansed and transformed, and it will be
shaped ready for the Modeling phase.
CRISP-DM modeling phase In the modeling phase, various techniques are applied to the data. The
models are further tweaked and refined, and this may involve going back to
the Data Preparation phase in order to correct any unexpected issues.
CRISP-DM evaluation The models need to be tested and verified to ensure that it meets the
business objectives that were defined initially in the business understanding
phase. Otherwise, we may have built a model that does not answer the business
question.
CRISP-DM deployment The models are published so that the customer can make use of them. This
is not the end of the story, however.

Then, the CRISP-DM process restarts. We live in a world of ever-changing data, business requirements, customer needs, and environments, and the process will be repeated.

CRISP-DM is the possibly the most well-known framework for implementing machine learning projects specifically.  It has a good focus on the business understanding piece.

What’s wrong with CRISP-DM?

The model no longer seems to be actively maintained. At the time of writing, the official site, CRISP-DM.org, is no longer being maintained. Further, the framework itself has not been updated on issues on working with new technologies, such as Big Data.

As a project leader, I want to keep up-to-date with the newest frameworks, and the newest technology. It’s true what they say; you won’t get a change until you make a chance.

The methodology itself was conceived in 1996, 21 years ago. I’m not the only one to come out and say so: industry veteran Gregory Piatetsky of KDNuggets had the following to say:

CRISP-DM remains the most popular methodology for analytics, data mining, and data science projects, with 43% share in latest KDnuggets Poll, but a replacement for unmaintained CRISP-DM is long overdue.

Yes, people. Just because something’s popular, it doesn’t mean that it is automatically right. Since the title ‘data scientist’ is the new sexy, lots of inexperienced data scientists are rushing to use this model because it is the obvious one. I don’t think I’d be serving my customers well if I didn’t keep up-to-date, and that’s why I’m moving away from CRISP-DM to the Microsoft Team Data Science Process.

CRISP-DM also neglects aspects of decision making. James Taylor, a veteran of the PASS Business Analytics events, explains this issue in great detail in his blog series over at KDNuggets. If you haven’t read his work, or  I recommend you read his article now and learn from his wisdom.

How does technology impinge on CRISP-DM?

Big Data technologies mean that there can be additional effort spend in the Data Understanding phase, for example, as the business grapples with the additional complexities that are involved in the shape of Big Data sources.

What comes after CRISP-DM? Enter the Team Data Science Process

The next framework, Microsoft’s Team Data Science Process framework, is aimed at including Big Data as a data source. As previously stated, the Data Understanding can be more complex.

Big Data and the Five Vs

There are debates about the number of Vs that apply to Big Data, but let’s go with Ray Wang’s definitions here. Given that our data can be subject to the five Vs as follows:

screen-shot-2012-02-19-at-11-51-19-pm-600x394

This means that our data becomes more confusing for business users to understand and process. This issue can easily distract the business team away from what they are trying to achieve. So, following the Microsoft Team Data Science process can help us to ensure that we have taken our five Vs into account, whilst keep things ticking along for the purpose of the business goal.

As we stated previously, CRISP-DM doesn’t seem to be actively maintained. With Microsoft dollars behind it, the Team Data Science process isn’t going away anytime soon.

What is the Team Data Science Process?

The process is shown in this diagram, courtesy of Microsoft:

tdsp-lifecycle

The Team Data Science Process is loosely divided into five main phases:

  • Business Understanding
  • Data Acquisition and Understanding
  • Modelling
  • Deployment
  • Customer Acceptance
 Phase  Description
Business
Understanding
The Business Understanding process starts with a business idea, which is solved with a machine learning solution. A project plan is generated.
Data Acquisition and Understanding This important phase focuses fact-finding about the data.
Modelling The model is created, built and verified against the original business question. The metrics are evaluated against the key metrics.
Deployment The models are published to production, once they are proven to be a fit solution to the original business question
Customer Acceptance This process is the customer sign-off point. It confirm that the pipeline, the model, and their deployment in a production environment are satisfying customer objectives.

 

The TSDP process itself is not linear; the output of the Data Acquisition and Understanding phase can feed back to the Business Understanding phase, for example. When the essential technical pieces start to appear, such as connecting to data, and the integration of multiple data sources then there may be actions arising from this effort.

The TDSP process is cycle rather than a linear process, and it does not finish, even if the model is deployed. Keep testing and evaluating that model!

TSDP Next Steps

There are a lot of how-to guides and downloads over at the TSDP website, so you should head over and take a look.

The Data Science ‘unicorn’ does not exist. Thanks to Hortonworks for their image below:

unicorn

To mitigate this lack of Data Science unicorn, Team Data Science Summary is a team-oriented solutions which emphasize teamwork and collaboration throughout. It recognizes the importance of working as part of a team to deliver Data Science projects. It also offers useful information on the importance of having standardized source control and backups. It can include open source technology as well as Big Data technologies.

To summarise, the TSDP comprises of a clear structure for you to follow throughout the Data Science process, and facilitates teamwork and collaboration along the way.

Business Book Review: Start with Why by Simon Sinek

Perhaps a better subtitle might be: ‘Start with Why: how Great leaders inspire others to focus’ and succeed.

Great leaders and organizations are good at seeing what most people can’t see, which is the mindset of having a longer-term vision. Starting with clear focus and *why* is a great start, and presumably shows that you have really thought about it.
People, brands and organisations need to start with WHY give people a way to tell the outside world who they are and what they believe.

As an external consultant, I have found the ‘celery test’ extremely useful when advising customers. Here is Sinek himself, on the topic:

Although the subtitle is about inspiring great leaders to take action, I found that I took the idea of ‘focus’ away from the book. Sometimes, I see organisations acting like a start-up; trying to achieve a breadth and coverage quickly, and hoping that something will stick with customers.

Starting with WHY means that organisations can achieve integrity, because they will have success, and a story to go with success. Organisations, large and small – and even one-man-band consultants – need to think about their megaphone and what they are actually saying to customers.
Don’t be fooled into thinking that customers pay much attention. Instead, they ‘snapshot’ and you only have a small time to get your message across. By keeping your messaging simple, it means that there is less ‘noise’ for confusion.
I recommend you read it – I found it inspirational, and it helped me to get back to my ‘story’, and to think about my customer’s ‘stories’ as well. I read that it was over-long but I liked Sinek’s way of weaving story and ‘relatable’ anecdote with the points he was making. Sometimes I would find something different in the anecdote than his intention, so I was taking my understanding to a different level.

Favourite Quotes

For values or guiding principles to be truly effective they have to be verbs. It’s not “integrity,” it’s “always do the right thing.” It’s not “innovation,” it’s “look at the problem from a different angle.” Articulating our values as verbs gives us a clear idea – we have a clear idea of how to act in any situation.

 

Review: Start with Why: How Great Leaders Inspire Everyone to Take Action

Start with Why: How Great Leaders Inspire Everyone to Take Action
Start with Why: How Great Leaders Inspire Everyone to Take Action by Simon Sinek
My rating: 5 of 5 stars

Perhaps a better subtitle might be: ‘Start with Why: how Great leaders inspire others to focus’ and succeed. Great leaders and organizations are good at seeing what most people can’t see, which is the mindset of having a longer-term vision. Starting with clear focus and *why* is a great start, and presumably shows that you have really thought about it.
People, brands and organisations need to start with WHY give people a way to tell the outside world who they are and what they believe. As an external consultant, I have found the ‘celery test’ extremely useful when advising customers. Although the subtitle is about inspiring great leaders to take action, I found that I took the idea of ‘focus’ away from the book. Ofte, I see organisations acting like a start-up; trying to achieve a breadth and coverage quickly, and hoping that something will stick with customers. Starting with WHY means that organisations can achieve integrity, because they will have success, and a story to go with success. Organisations, large and small – and even one-man-band consultants – need to think about their megaphone and what they are actually saying to customers.
Don’t be fooled into thinking that customers pay much attention. Instead, they ‘snapshot’ and you only have a small time to get your message across. By keeping your messaging simple, it means that there is less ‘noise’ for confusion.
I recommend you read it – I found it inspirational, and it helped me to get back to my ‘story’, and to think about my customer’s ‘stories’ as well.

View all my reviews

Microsoft Data Insights – Digital Transformation with Power BI for the CEO

I’m holding a series of training courses around the UK, more details will be published. In the first instance, on 15th September, I’ll be holding a day-long practical workshop on Working with Business Data for Busy Executives in SME Organisations in Hertfordshire, England. The cost will be £100 pounds plus VAT, food and workshop materials included, and you can also network and share experiences with other attendees who will also be running businesses, like you.

I don’t believe in a ‘stack ’em high’ approach, which doesn’t give a pleasant experience for learning. So, classes will be restricted to 12 people only, unless otherwise stated. This means that you will get a good amount of attention.

I’m doing the Executive MBA at the University of Hertfordshire Business School, I’ve also been a NED (Non Executive Director) for PASS, who are based in the United States.  As I’ve been spending time leading organisations, I’m keen to share this knowledge and expertise with the community from a data-driven, data leader perspective. The following blog post will give you a flavor of the workshop. along with some of my thoughts on Microsoft Data Insights Summit. If you have any questions, please pop them in the comments box and I’ll read them from there.

Here are my slides from Microsoft Data Insights Summary, combined with some of the slides from the keynote by James Phillips, held in June 2017.

Slide1

For those of you who know me, you’ll know that I have extensive experience in Tableau as well as Power BI. However, most of my consulting data visualisation is in Power BI suite of products. Why is that?

Tableau is wonderful at data visualisation, as is Power BI, of course. However, for enterprise customers, where I’m building a data warehouse, I prefer having analytics closer to the data source, perhaps in a data warehouse or data lake. I like to think about the overall business intelligence architecture. Tableau is superb at data visualisation and it also cleans and integrates data, but to a much lesser extend, which is why they partner so well with Alteryx. I don’t like cleaning data or doing repeatable analytics so close to the end reporting layer and business people seem to want to do it there, without thinking of issues such as robustness, repeat-ability and longevity in the analytical formula that they are creating. I prefer to hand off clean data and analytical formula to the reporting tool as far as possible.

I’m not thinking about Business Intelligence in terms of a spot solution for data visualisation or reporting, for example. I’m looking at the whole canvas. I prefer to clean the data and have it all fixed closer to the source, so that I can get the same number for the same report, regardless of the reporting technology that I use. With Power BI, I can stay within the Microsoft playpen of technologies. I do note however that Tableau Server is in Azure and if you are looking at analytics, that’s another option so that the analytics formula isn’t contained in disparate workbooks. Instead, they are published to Tableau Server and people can should download their workbooks there, for ‘one version of the truth’.

As an external consultant, I work with Power BI because I think it has an astonishing reach technically as well as geographically. Some of my customers are global and I really need the certainty of global resiliency.. Gone are the days when Microsoft had a lot of disparate reporting technologies that didn’t talk to one another very well and we had lots of different interfaces that used to overlap. Customers got really confused about what to use. For example, do you put your KPIs in Analysis Services, or in Reporting Services? Now:

The answer is always Power BI! Take a look:

Microsoft Data Insights Summit

Apart from Data Visualization, what is Power BI useful for?

Power BI is particularly useful for:

  • businesses that are acquiring other businesses and they need somewhere to put the data, and keep the business running in the meantime
  • cost savings
  • GDPR – if you don’t know what this is, you need to contact me to find out more. Microsoft are in the forefront of working with customers to make sure that they are compliant.

Do I still see Tableau?

Yes – some of my customers don’t need public cloud because they pop up their own data centres if and when and where they need them. So, for them, they tend to stick with what they know, and what works for them.

What Business Intelligence tools do I see less of?

I see Qlikview less and less, as customers look to align their reporting and they can replicate their Qlik scripts in SQL Server and SSRS.

I also don’t see Pyramid Analytics appearing much, and I don’t get asked often about them. According to the Gartner report, 2017 may represent a critical period for the company and, rightly or wrongly, the Gartner Magic Quadrant does carry enormous weight when customers are looking for solutions. With many solutions, customers don’t use the full range of features contained in any technical solution, and Pyramid are going to have to work hard to explain how they compare / compete with the Power BI on-premise solution, which is going to go from strength to strength.

However, for others, particularly in the SME market, the Azure offering is extremely compelling. Power BI and Azure together mean that you can focus on the business, rather than working on the technology to support the business. I can also see that more and more data is going into the cloud, and I am part of projects where I am doing exactly that – cloud business intelligence. Cloud Business Intelligence is a real growth offering for me and I plan to keep being ahead of the curve.

Microsoft Data Insights Summit

Are people using Power BI or is it simply good Microsoft Marketing?

People are using it, yes. Here are the numbers, produced by James Phillips during the Power BI Keynote: Microsoft Data Insights Summit

Power BI and the C-Suite

Given it’s reach within the organisation, Power BI can reach the C-suite level as well as the rest of us, in the organisation. Before continuing, it’s probably worth reading about linear vs exponential business growth models e.g. HBR.

 

You can watch the video below, or read on for some of the headlines:

Here are some headlines:

Gross and Net Profit

Net profit

Progress Towards Targets

Revenues and revenue growth rate

Expenses

Employee Engagement

Let’s get started!

Gross and Net Profit

Slide18 Why do CEOs care? As part of the Digital Transformation process, the CEO must develop a guiding philosophy about how he or she can best add value whilst showing ongoing strategic assessment and planning. However, it is difficult for them to allocate time to the collection, cultivation and analysis of data. Instead, they need to focus on strategic decisions, and they need data to run their business, to understand how their customers behave and measure what really matters to the organization. Power BI can help to bring clarity and predictability to the CEO, and this session is aimed at CEOs, and those who support them with data, in order to see how they can be empowered by Power BI, and see it as a key asset within the organisations short and long term future. Slide16

Net Profit

This goes without saying, but keeping an eye on net profit at all times is essential for business leaders. This might be visualized as a line graph or quarterly chart. However you decide to represent the data, it needs to provide detailed, regularly updated information. You can get added value by allowing this data to be broken down.
With Power BI, you’d be able to tap your chart and see real-time data on profits by region, product type or team. First you calculate your gross profit, then your expenses, subtract expenses from gross profit, and you have net profit.

Calculate Gross Profit first:
Gross profit, also called gross margin, shows you how much money you made from selling a product.
It subtracts the selling price from your wholesale cost to calculate the difference. It does not take into account expenses from rent, personnel, supplies, taxes or interest. Gross profit is a required step toward calculating the company’s income or net profit.

Progress Towards Targets

You can use EXPON.DIST function in Microsoft Excel to help measure progress towards your targets.
Use EXPON.DIST to model the time between events, such as how long from the order placement takes to actual delivery. For example, you can use EXPON.DIST to determine the probability that the process takes at most 1 minute.

Revenues and revenue growth rate

By being able to instantly visualize how fast (or otherwise) your business is growing its revenues, it’s much easier to find out what’s going right and what’s going wrong. Need to lose some dead weight? Invest in a growing department? Respond to a new trend among consumers? Tracking your revenues closely is crucial and will help with those decisions. A line graph would again be particularly clear in this instance.

Slide22 Slide21

Think of a Rubik’s cube – people instinctively know how to use them, and to arrange the cube into colors. We also interact with colour and data in the same way; intuitively and quickly.

Expenses

Whether it’s staff, machinery, IT or property, your expenses are one of the biggest drains on your long term success. A dashboard can break these down instantly so you can see where your biggest outgoings are, and then make decisions about what’s costing too much.

Revenue per employee
Revenue per employee is a little like Return on Investment. Are your people actually making enough revenue to justify hiring them? Are they working at 100% capacity or is there room for them to work more, instead of employing new workers? A revenue per employee dashboard helps you make these choices rationally.

Employee Engagement

Measured by an anonymous survey, employee engagement is a key BI factor for any CEO. If your people are motivated, enthusiastic and giving their work 100%, you can be sure your company will grow. By contrast, unengaged colleagues will be a detriment to productivity. It’s essential to keep regular tabs on how employees are feeling about their work.

 

Summary

As part of the Digital Transformation process, the CEO must develop a guiding philosophy about how he or she can best add value whilst showing ongoing strategic assessment and planning. However, it is difficult for them to allocate time to the collection, cultivation and analysis of data. Instead, they need to focus on strategic decisions, and they need data to run their business, to understand how their customers behave and measure what really matters to the organization.
Power BI can help to bring clarity and predictability to the CEO, and this session is aimed at CEOs, and those who support them with data, in order to see how they can be empowered by Power BI, and see it as a key asset within the organisations short and long term future.