What’s wrong with CRISP-DM, and is there an alternative?

Many people, including myself, have discussed CRISP-DM in detail. However, I didn’t feel totally comfortable with it, for a number of reasons which I list below. Now I had raised a problem, I needed to find a solution and that’s where the Microsoft Team Data Science Process comes in. Read on for more detail!

  • What is CRISP-DM?
  • What’s wrong with CRISP-DM?
  • How does technology impinge on CRISP-DM?
  • What comes after CRISP-DM? Enter the Team Data Science Process?
  • What is the Team Data Science Process?

 

What is CRISP-DM?

One common methodology is the CRISP-DM methodology (The Modeling Agency). The Cross Industry Standard Process for Data Mining or (CRISP-DM) model as it is known, is a process framework for designing, creating, building, testing, and deploying machine learning solutions. The process is arranged into six phases. The phases can be seen in the following diagram:

crisp-dm-300x293

The phases are described below

 Phase  Description
Business
Understanding / Data Understanding
The first phase looks at the machine learning
solution from the business standpoint, rather than a technical standpoint.
Once the business concept is defined, the Data Understanding phase focuses on
data familiarity and collation.
Data Preparation In this stage, data will be cleansed and transformed, and it will be
shaped ready for the Modeling phase.
CRISP-DM modeling phase In the modeling phase, various techniques are applied to the data. The
models are further tweaked and refined, and this may involve going back to
the Data Preparation phase in order to correct any unexpected issues.
CRISP-DM evaluation The models need to be tested and verified to ensure that it meets the
business objectives that were defined initially in the business understanding
phase. Otherwise, we may have built a model that does not answer the business
question.
CRISP-DM deployment The models are published so that the customer can make use of them. This
is not the end of the story, however.

Then, the CRISP-DM process restarts. We live in a world of ever-changing data, business requirements, customer needs, and environments, and the process will be repeated.

CRISP-DM is the possibly the most well-known framework for implementing machine learning projects specifically.  It has a good focus on the business understanding piece.

What’s wrong with CRISP-DM?

The model no longer seems to be actively maintained. At the time of writing, the official site, CRISP-DM.org, is no longer being maintained. Further, the framework itself has not been updated on issues on working with new technologies, such as Big Data.

As a project leader, I want to keep up-to-date with the newest frameworks, and the newest technology. It’s true what they say; you won’t get a change until you make a chance.

The methodology itself was conceived in 1996, 21 years ago. I’m not the only one to come out and say so: industry veteran Gregory Piatetsky of KDNuggets had the following to say:

CRISP-DM remains the most popular methodology for analytics, data mining, and data science projects, with 43% share in latest KDnuggets Poll, but a replacement for unmaintained CRISP-DM is long overdue.

Yes, people. Just because something’s popular, it doesn’t mean that it is automatically right. Since the title ‘data scientist’ is the new sexy, lots of inexperienced data scientists are rushing to use this model because it is the obvious one. I don’t think I’d be serving my customers well if I didn’t keep up-to-date, and that’s why I’m moving away from CRISP-DM to the Microsoft Team Data Science Process.

CRISP-DM also neglects aspects of decision making. James Taylor, a veteran of the PASS Business Analytics events, explains this issue in great detail in his blog series over at KDNuggets. If you haven’t read his work, or  I recommend you read his article now and learn from his wisdom.

How does technology impinge on CRISP-DM?

Big Data technologies mean that there can be additional effort spend in the Data Understanding phase, for example, as the business grapples with the additional complexities that are involved in the shape of Big Data sources.

What comes after CRISP-DM? Enter the Team Data Science Process

The next framework, Microsoft’s Team Data Science Process framework, is aimed at including Big Data as a data source. As previously stated, the Data Understanding can be more complex.

Big Data and the Five Vs

There are debates about the number of Vs that apply to Big Data, but let’s go with Ray Wang’s definitions here. Given that our data can be subject to the five Vs as follows:

screen-shot-2012-02-19-at-11-51-19-pm-600x394

This means that our data becomes more confusing for business users to understand and process. This issue can easily distract the business team away from what they are trying to achieve. So, following the Microsoft Team Data Science process can help us to ensure that we have taken our five Vs into account, whilst keep things ticking along for the purpose of the business goal.

As we stated previously, CRISP-DM doesn’t seem to be actively maintained. With Microsoft dollars behind it, the Team Data Science process isn’t going away anytime soon.

What is the Team Data Science Process?

The process is shown in this diagram, courtesy of Microsoft:

tdsp-lifecycle

The Team Data Science Process is loosely divided into five main phases:

  • Business Understanding
  • Data Acquisition and Understanding
  • Modelling
  • Deployment
  • Customer Acceptance
 Phase  Description
Business
Understanding
The Business Understanding process starts with a business idea, which is solved with a machine learning solution. A project plan is generated.
Data Acquisition and Understanding This important phase focuses fact-finding about the data.
Modelling The model is created, built and verified against the original business question. The metrics are evaluated against the key metrics.
Deployment The models are published to production, once they are proven to be a fit solution to the original business question
Customer Acceptance This process is the customer sign-off point. It confirm that the pipeline, the model, and their deployment in a production environment are satisfying customer objectives.

 

The TSDP process itself is not linear; the output of the Data Acquisition and Understanding phase can feed back to the Business Understanding phase, for example. When the essential technical pieces start to appear, such as connecting to data, and the integration of multiple data sources then there may be actions arising from this effort.

The TDSP process is cycle rather than a linear process, and it does not finish, even if the model is deployed. Keep testing and evaluating that model!

TSDP Next Steps

There are a lot of how-to guides and downloads over at the TSDP website, so you should head over and take a look.

The Data Science ‘unicorn’ does not exist. Thanks to Hortonworks for their image below:

unicorn

To mitigate this lack of Data Science unicorn, Team Data Science Summary is a team-oriented solutions which emphasize teamwork and collaboration throughout. It recognizes the importance of working as part of a team to deliver Data Science projects. It also offers useful information on the importance of having standardized source control and backups. It can include open source technology as well as Big Data technologies.

To summarise, the TSDP comprises of a clear structure for you to follow throughout the Data Science process, and facilitates teamwork and collaboration along the way.

Guess who is appearing in Joseph Sirosh’s PASS Keynote?

This girl! I am super excited and please allow me to have one little SQUUEEEEEEE! before I tell you what’s happening. Now, this is a lifetime achievement for me, and I cannot begin to tell you how absolutely and deeply honoured I am. I am still in shock!

I am working really hard on my demo and….. I am not going to tell you what it is. You’ll have to watch it. Ok, enough about me and all I’ll say is two things: it’s something that’s never been done at PASS Summit before and secondly, watch the keynote because there may be some discussion about….. I can’t tell you what… only that, it’s a must-watch, must-see, must do keynote event.

We are in a new world of Data and Joseph Sirosh and the team are leading the way. Watching the keynote will mean that you get the news as it happens, and it will help you to keep up with the changes. I do have some news about Dr David DeWitt’s Day Two keynote… so keep watching this space. Today I’d like to talk about the Day One keynote with the brilliant Joseph Sirosh, CVP of Microsoft’s Data Group.

Now, if you haven’t seen Joseph Sirosh present before, then you should. I’ve put some of his earlier sessions here and I recommend that you watch them.

Ignite Conference Session

MLDS Atlanta 2016 Keynote

I hear you asking… what am I doing in it? I’m keeping it a surprise! Well, if you read my earlier blog, you’ll know I transitioned from Artificial Intelligence into Business Intelligence and now I do a hybrid of AI and BI. As a Business Intelligence professional, my customers will ask me for advice when they can’t get the data that they want. Over the past few years, the ‘answer’ to their question has gone far, far beyond the usual on-premise SQL Server, Analysis Services, SSRS combo.

We are now in a new world of data. Join in the fun!

Customers sense that there is a new world of data. The ‘answer’ to the question Can you please help me with my data?‘ is complex, varied and it’s very much aimed at cost sensitivities, too. Often, customers struggle with data because they now have a Big Data problem, or a storage problem, or a data visualisation access problem. Azure is very neat because it can cope with all of these issues. Now, my projects are Business Intelligence and Business Analytics projects… but they are also ‘move data to the cloud’ projects in disguise, and that’s in response to the customer need. So if you are Business Intelligence professional, get enthusiastic about the cloud because it really empowers you with a new generation of exciting things you can do to please your users and data consumers.

As a BI or an analytics professional, cloud makes data more interesting and exciting. It means you can have a lot more data, in more shapes and sizes and access it in different ways. It also means that you can focus on what you are good at, and make your data estate even more interesting by augmenting it with cool features in Azure. For example, you could add in more exciting things such as Apache Tika library as a worker role in Azure to crack through PDFs and do interesting things with the data in there. If you bring it into SSIS, then you can tear it up and down again when you don’t need it.

I’d go as far as to say that, if you are in Business Intelligence at the moment, you will need to learn about cloud sooner or later. Eventually, you’re going to run into Big Data issues. Alternatively, your end consumers are going to want their data on a mobile device, and you will want easy solutions to deliver it to them. Customers are interested in analytics and the new world of data and you will need to hop on the Azure bus to be a part of it.

The truth is; Joseph Sirosh’s keynotes always contain amazing demos. (No pressure, Jen, no pressure….. ) Now, it’s important to note that these demos are not ‘smoke and mirrors’….

The future is here, now. You can have this technology too.

It doesn’t take much to get started, and it’s not too far removed from what you have in your organisation. AzureML and Power BI have literally hundreds of examples. I learned AzureML looking at the following book by Wee-Hyong Tok and others, so why not download a free book sample?

https://read.amazon.co.uk/kp/card?asin=B00MBL261W&preview=inline&linkCode=kpe&ref_=cm_sw_r_kb_dp_c54ayb2VHWST4

How do you proceed? Well, why not try a little homespun POC with some of your own data to learn about it, and then show your boss. I don’t know about you but I learn by breaking things, and I break things all the time when I’m  learning. You could download some Power BI workbooks, use the sample data and then try to recreate them, for example. Or, why not look at the community R Gallery and try to play with the scripts. you broke something? no problem! Just download a fresh copy and try again. You’ll get further next time.

I hope to see you at the PASS keynote! To register, click here: http://www.sqlpass.org/summit/2016/Sessions/Keynotes.aspx 

WPC Day One: Translating Digital Transformation into Solutions

I blogged over at my ‘official’ company blog about strategic considerations regarding Digital Transformation. There is a lot of messaging directed at sales, partners and CEO level conversations. For the techies, however, how does the strategy translate into a technical implementation that you can actually deliver, to facilitate Digital Transformation within the organisation? In other words, how do you make solutions that are sustainable and relevant?

Microsoft can help with modern, cloud-based tools and a cloud platform. Partners have the ability to use tools such as Office365, Power BI, Microsoft Flow and AzureML to reduce the integration cost and friction to deliver technical solutions. These partners can speak directly to the digital transformation, and lead it. These tools can form composable units or modules, which can be fitted together to meet business needs directly, thereby facilitating digital transformation.

What are these tools? During the WPC keynote, Ecolabs showed off their solution, which involved Power Bi and Microsoft Flow. Here is the example Microsoft Power BI Solution below:
WPC Day 1 Slides
Microsoft Flow is a new tool, which was used to create some of the workflows to align the productivity processes with the resulting dashboard.

What is Microsoft Flow? Well, it’s a great little app and I think you should take a look. Microsoft Flow allows you to create automated workflows between your business or consumer applications and services and connects them so that you get some action, such as notifications, synchronize files, collect data, and more actions that might be useful to your business.

Why is that useful for a Business Intelligence implementation? Well, it can help to track where your data is going. As someone who often goes into organisations where people have ‘lost’ data or it is hiding somewhere that the business people can’t get it, I see Microsoft Flow as a way forward for Digital Transformation in the business by facilitating the flow of data around the organisation.

You can even create workflows on your mobile device. Here is the Ecolabs example from WPC:
WPC Day 1 Slides
Basically, a Flow connects your web services, files, and cloud-based data to save time and effort for everyone, every day.

It’s good to see that Microsoft are a much more open organisation these days; I think that Microsoft Flow is evidence of the open attitude towards other companies, organisations and methodologies that are outside of the Microsoft corporate boundary. In particular, I am a huge fan of Wunderlist and they mentioned it yesterday during the Day One keynote. I know that Wunderlist have been bought by Microsoft and I hope that Wunderlist will appear in Office365 soon, such as in Outlook.

How does Flow work? Well, you start with a template, which gives you a great head start. Why not give it a blast? If it means you get to use Wunderlist as well for all of your lists, and start to love it, then you can thank me!

 

You could even use Microsoft Flow for new Github issues, and send a notification to Slack. Or perhaps you could use Flow so that you retain Dropbox as your file storage system, integrated with Office365. The examples are endless, I think.

All this shows that the cloud is a great enabler, and a platform, which partners and companies can use in order to make their organisations more productive and collaborative. These are simple examples, and I’m sure that you can think of more! The integrations all happen in the cloud, and it is one way that the cloud can be used as a tool for Digital Transformation.

Any questions, send me an email at hello@datarelish.com.

Kind Regards,

Jen Stirrup

JenStirrup

 

 

sMy handy toolkit for my Azure IoT Project – how the Microsoft Partner Network can help

In this series, I’m writing a bunch of very practical posts on helping you through an IoT project. There are plenty of other posts about the ‘why’ and the marketing buzz, but this is about the ‘do’.

If you are using Azure, the chances are that you might be a Microsoft Partner already. There are some useful goodies in there, and you may not be aware of these opportunities. The benefits of the Microsoft Partner network can be found here. However, it can be hard to relate the list to actual projects, and this blog is aimed at translating these benefits into something tangible that can help you on your IoT project. Firstly, though, take a look at the Action Pack subscription video in order to get some background:

How can this help you to start on your IoT project? Well, if you are starting out on IoT and Azure, then the first thing you’ll need are some handy free Azure credits. Now, if you have an MSDN subscription, then you will also have free Azure credits. Did you know that you can get free credits as part of your Microsoft Partner Action Pack subscription as well? Members of the Microsoft Action Pack program receive monthly credits of £65 of Azure at no charge, and the terms and conditions can be found here.

In practice, these means that you can set up two subscriptions for your Azure account; one for MSDN, and the other for your Microsoft Partner Azure credits.

To help you start out on your Azure project with IoT, you can get five internal use licenses for Office365. This is extremely useful, because it means you can download the Office software. So, in my projects, I recommend my customers become a partner with the Action Pack subscription since they will get one the following:

  • Microsoft Office 365—either five seats Office on-premises and five Microsoft Office 365, or 10 seats Office on-premises. You can earn more seats of Office 365 after an additional cloud sale.
  • Microsoft Dynamics CRM—no Microsoft Dynamics CRM Online licenses are granted at the subscription point. These licenses are granted after you close one Microsoft Dynamics CRM Online deal or at least 50 seats of Office 365 in the previous 12 months.

For your IoT project, the first option is particularly useful in the following scenarios:

If you have taken on new team members to do an AzureML project, then you are going to need Office software such as Excel, in order to view data. If people are choosing a career in AzureML, then you can make a safe bet that these team members will want to use the latest and greatest technology. This means that giving them Excel 2007, for example, isn’t going to work. Happy team members produce better results, and it’s important to empower them with the tools that they need, and *want*, to do the a job that they are proud of doing.

If you have Office365, then you can hook up your data nicely so that you can see and share it in Power BI.

  • What is your call to action?
  • Sign up for the Microsoft Partner, and enrol for the cloud Programs
  • Sign up as an Action Pack Subscriber
  • Make sure to look at your benefits, and you’ll see the Azure subscription credits and your Office365 licence keys. To do this, go to Resources, and then look for ‘Access my software and cloud benefits’.

Using my Partner Azure credits, and my MSDN credits means that I have two separate subscriptions for paying for Azure. In my case, I have a subscription for my own Virtual Machines for development, and then a different Subscription for my Proof of Concept work and the portfolio I’m building for demonstrations. It helps me to keep an eye on how much credit I’m “spending” on development work on Azure VMs for development work. At the moment, I have a few physical servers which I *used* to use for development, but I like the portability of having everything in the cloud. It will mean I don’t have to lug my heavy Dell mobile workstation around with me. For demonstrations, I can video my demos in advance in case I can’t access the cloud for some reason. If I find I’m incurring a lot of Azure credits and paying money, then I need to decide whether to purchase another physical machine, or stick with Azure. So far, Azure is winning on cost, and on factors such as performance and reliability, and ease of use. Running a small business and being on the PASS Board mean that I’m incredibly busy, and I need to be careful how I spend my time and effort. As you can understand, doing a lot of tech support may not be the best use of my time – even though I do enjoy it!

Now, you’re ready to move to the next step! There are a range of choices for architecting an IoT project, and I will talk about some of these issues in my next blog post.