What’s wrong with CRISP-DM, and is there an alternative?

Many people, including myself, have discussed CRISP-DM in detail. However, I didn’t feel totally comfortable with it, for a number of reasons which I list below. Now I had raised a problem, I needed to find a solution and that’s where the Microsoft Team Data Science Process comes in. Read on for more detail!

  • What is CRISP-DM?
  • What’s wrong with CRISP-DM?
  • How does technology impinge on CRISP-DM?
  • What comes after CRISP-DM? Enter the Team Data Science Process?
  • What is the Team Data Science Process?

 

What is CRISP-DM?

One common methodology is the CRISP-DM methodology (The Modeling Agency). The Cross Industry Standard Process for Data Mining or (CRISP-DM) model as it is known, is a process framework for designing, creating, building, testing, and deploying machine learning solutions. The process is arranged into six phases. The phases can be seen in the following diagram:

crisp-dm-300x293

The phases are described below

 Phase  Description
Business
Understanding / Data Understanding
The first phase looks at the machine learning
solution from the business standpoint, rather than a technical standpoint.
Once the business concept is defined, the Data Understanding phase focuses on
data familiarity and collation.
Data Preparation In this stage, data will be cleansed and transformed, and it will be
shaped ready for the Modeling phase.
CRISP-DM modeling phase In the modeling phase, various techniques are applied to the data. The
models are further tweaked and refined, and this may involve going back to
the Data Preparation phase in order to correct any unexpected issues.
CRISP-DM evaluation The models need to be tested and verified to ensure that it meets the
business objectives that were defined initially in the business understanding
phase. Otherwise, we may have built a model that does not answer the business
question.
CRISP-DM deployment The models are published so that the customer can make use of them. This
is not the end of the story, however.

Then, the CRISP-DM process restarts. We live in a world of ever-changing data, business requirements, customer needs, and environments, and the process will be repeated.

CRISP-DM is the possibly the most well-known framework for implementing machine learning projects specifically.  It has a good focus on the business understanding piece.

What’s wrong with CRISP-DM?

The model no longer seems to be actively maintained. At the time of writing, the official site, CRISP-DM.org, is no longer being maintained. Further, the framework itself has not been updated on issues on working with new technologies, such as Big Data.

As a project leader, I want to keep up-to-date with the newest frameworks, and the newest technology. It’s true what they say; you won’t get a change until you make a chance.

The methodology itself was conceived in 1996, 21 years ago. I’m not the only one to come out and say so: industry veteran Gregory Piatetsky of KDNuggets had the following to say:

CRISP-DM remains the most popular methodology for analytics, data mining, and data science projects, with 43% share in latest KDnuggets Poll, but a replacement for unmaintained CRISP-DM is long overdue.

Yes, people. Just because something’s popular, it doesn’t mean that it is automatically right. Since the title ‘data scientist’ is the new sexy, lots of inexperienced data scientists are rushing to use this model because it is the obvious one. I don’t think I’d be serving my customers well if I didn’t keep up-to-date, and that’s why I’m moving away from CRISP-DM to the Microsoft Team Data Science Process.

CRISP-DM also neglects aspects of decision making. James Taylor, a veteran of the PASS Business Analytics events, explains this issue in great detail in his blog series over at KDNuggets. If you haven’t read his work, or  I recommend you read his article now and learn from his wisdom.

How does technology impinge on CRISP-DM?

Big Data technologies mean that there can be additional effort spend in the Data Understanding phase, for example, as the business grapples with the additional complexities that are involved in the shape of Big Data sources.

What comes after CRISP-DM? Enter the Team Data Science Process

The next framework, Microsoft’s Team Data Science Process framework, is aimed at including Big Data as a data source. As previously stated, the Data Understanding can be more complex.

Big Data and the Five Vs

There are debates about the number of Vs that apply to Big Data, but let’s go with Ray Wang’s definitions here. Given that our data can be subject to the five Vs as follows:

screen-shot-2012-02-19-at-11-51-19-pm-600x394

This means that our data becomes more confusing for business users to understand and process. This issue can easily distract the business team away from what they are trying to achieve. So, following the Microsoft Team Data Science process can help us to ensure that we have taken our five Vs into account, whilst keep things ticking along for the purpose of the business goal.

As we stated previously, CRISP-DM doesn’t seem to be actively maintained. With Microsoft dollars behind it, the Team Data Science process isn’t going away anytime soon.

What is the Team Data Science Process?

The process is shown in this diagram, courtesy of Microsoft:

tdsp-lifecycle

The Team Data Science Process is loosely divided into five main phases:

  • Business Understanding
  • Data Acquisition and Understanding
  • Modelling
  • Deployment
  • Customer Acceptance
 Phase  Description
Business
Understanding
The Business Understanding process starts with a business idea, which is solved with a machine learning solution. A project plan is generated.
Data Acquisition and Understanding This important phase focuses fact-finding about the data.
Modelling The model is created, built and verified against the original business question. The metrics are evaluated against the key metrics.
Deployment The models are published to production, once they are proven to be a fit solution to the original business question
Customer Acceptance This process is the customer sign-off point. It confirm that the pipeline, the model, and their deployment in a production environment are satisfying customer objectives.

 

The TSDP process itself is not linear; the output of the Data Acquisition and Understanding phase can feed back to the Business Understanding phase, for example. When the essential technical pieces start to appear, such as connecting to data, and the integration of multiple data sources then there may be actions arising from this effort.

The TDSP process is cycle rather than a linear process, and it does not finish, even if the model is deployed. Keep testing and evaluating that model!

TSDP Next Steps

There are a lot of how-to guides and downloads over at the TSDP website, so you should head over and take a look.

The Data Science ‘unicorn’ does not exist. Thanks to Hortonworks for their image below:

unicorn

To mitigate this lack of Data Science unicorn, Team Data Science Summary is a team-oriented solutions which emphasize teamwork and collaboration throughout. It recognizes the importance of working as part of a team to deliver Data Science projects. It also offers useful information on the importance of having standardized source control and backups. It can include open source technology as well as Big Data technologies.

To summarise, the TSDP comprises of a clear structure for you to follow throughout the Data Science process, and facilitates teamwork and collaboration along the way.

Latest post on my Handy IoT Toolkit is released!

d3f07-spyI’ve started to update my IoT Toolkit blog post series.

You can get the latest post from here, which gives you some ideas on communication for your virtual blended team, and some pointers towards handy Visio stencils that might help. You can also navigate my other IoT posts by going to the IoT Toolkit menu that I’m trying to keep updated.

sMy handy toolkit for my Azure IoT Project – how the Microsoft Partner Network can help

In this series, I’m writing a bunch of very practical posts on helping you through an IoT project. There are plenty of other posts about the ‘why’ and the marketing buzz, but this is about the ‘do’.

If you are using Azure, the chances are that you might be a Microsoft Partner already. There are some useful goodies in there, and you may not be aware of these opportunities. The benefits of the Microsoft Partner network can be found here. However, it can be hard to relate the list to actual projects, and this blog is aimed at translating these benefits into something tangible that can help you on your IoT project. Firstly, though, take a look at the Action Pack subscription video in order to get some background:

How can this help you to start on your IoT project? Well, if you are starting out on IoT and Azure, then the first thing you’ll need are some handy free Azure credits. Now, if you have an MSDN subscription, then you will also have free Azure credits. Did you know that you can get free credits as part of your Microsoft Partner Action Pack subscription as well? Members of the Microsoft Action Pack program receive monthly credits of £65 of Azure at no charge, and the terms and conditions can be found here.

In practice, these means that you can set up two subscriptions for your Azure account; one for MSDN, and the other for your Microsoft Partner Azure credits.

To help you start out on your Azure project with IoT, you can get five internal use licenses for Office365. This is extremely useful, because it means you can download the Office software. So, in my projects, I recommend my customers become a partner with the Action Pack subscription since they will get one the following:

  • Microsoft Office 365—either five seats Office on-premises and five Microsoft Office 365, or 10 seats Office on-premises. You can earn more seats of Office 365 after an additional cloud sale.
  • Microsoft Dynamics CRM—no Microsoft Dynamics CRM Online licenses are granted at the subscription point. These licenses are granted after you close one Microsoft Dynamics CRM Online deal or at least 50 seats of Office 365 in the previous 12 months.

For your IoT project, the first option is particularly useful in the following scenarios:

If you have taken on new team members to do an AzureML project, then you are going to need Office software such as Excel, in order to view data. If people are choosing a career in AzureML, then you can make a safe bet that these team members will want to use the latest and greatest technology. This means that giving them Excel 2007, for example, isn’t going to work. Happy team members produce better results, and it’s important to empower them with the tools that they need, and *want*, to do the a job that they are proud of doing.

If you have Office365, then you can hook up your data nicely so that you can see and share it in Power BI.

  • What is your call to action?
  • Sign up for the Microsoft Partner, and enrol for the cloud Programs
  • Sign up as an Action Pack Subscriber
  • Make sure to look at your benefits, and you’ll see the Azure subscription credits and your Office365 licence keys. To do this, go to Resources, and then look for ‘Access my software and cloud benefits’.

Using my Partner Azure credits, and my MSDN credits means that I have two separate subscriptions for paying for Azure. In my case, I have a subscription for my own Virtual Machines for development, and then a different Subscription for my Proof of Concept work and the portfolio I’m building for demonstrations. It helps me to keep an eye on how much credit I’m “spending” on development work on Azure VMs for development work. At the moment, I have a few physical servers which I *used* to use for development, but I like the portability of having everything in the cloud. It will mean I don’t have to lug my heavy Dell mobile workstation around with me. For demonstrations, I can video my demos in advance in case I can’t access the cloud for some reason. If I find I’m incurring a lot of Azure credits and paying money, then I need to decide whether to purchase another physical machine, or stick with Azure. So far, Azure is winning on cost, and on factors such as performance and reliability, and ease of use. Running a small business and being on the PASS Board mean that I’m incredibly busy, and I need to be careful how I spend my time and effort. As you can understand, doing a lot of tech support may not be the best use of my time – even though I do enjoy it!

Now, you’re ready to move to the next step! There are a range of choices for architecting an IoT project, and I will talk about some of these issues in my next blog post.