Many people, including myself, have discussed CRISP-DM in detail. However, I didn’t feel totally comfortable with it, for a number of reasons which I list below. Now I had raised a problem, I needed to find a solution and that’s where the Microsoft Team Data Science Process comes in. Read on for more detail!
- What is CRISP-DM?
- What’s wrong with CRISP-DM?
- How does technology impinge on CRISP-DM?
- What comes after CRISP-DM? Enter the Team Data Science Process?
- What is the Team Data Science Process?
What is CRISP-DM?
One common methodology is the CRISP-DM methodology (The Modeling Agency). The Cross Industry Standard Process for Data Mining or (CRISP-DM) model as it is known, is a process framework for designing, creating, building, testing, and deploying machine learning solutions. The process is arranged into six phases. The phases can be seen in the following diagram:
The phases are described below
Phase | Description |
---|---|
Business Understanding / Data Understanding |
The first phase looks at the machine learning solution from the business standpoint, rather than a technical standpoint. Once the business concept is defined, the Data Understanding phase focuses on data familiarity and collation. |
Data Preparation | In this stage, data will be cleansed and transformed, and it will be shaped ready for the Modeling phase. |
CRISP-DM modeling phase | In the modeling phase, various techniques are applied to the data. The models are further tweaked and refined, and this may involve going back to the Data Preparation phase in order to correct any unexpected issues. |
CRISP-DM evaluation | The models need to be tested and verified to ensure that it meets the business objectives that were defined initially in the business understanding phase. Otherwise, we may have built a model that does not answer the business question. |
CRISP-DM deployment | The models are published so that the customer can make use of them. This is not the end of the story, however. |
Then, the CRISP-DM process restarts. We live in a world of ever-changing data, business requirements, customer needs, and environments, and the process will be repeated.
CRISP-DM is the possibly the most well-known framework for implementing machine learning projects specifically. It has a good focus on the business understanding piece.
What’s wrong with CRISP-DM?
The model no longer seems to be actively maintained. At the time of writing, the official site, CRISP-DM.org, is no longer being maintained. Further, the framework itself has not been updated on issues on working with new technologies, such as Big Data.
As a project leader, I want to keep up-to-date with the newest frameworks, and the newest technology. It’s true what they say; you won’t get a change until you make a chance.
The methodology itself was conceived in 1996, 21 years ago. I’m not the only one to come out and say so: industry veteran Gregory Piatetsky of KDNuggets had the following to say:
CRISP-DM remains the most popular methodology for analytics, data mining, and data science projects, with 43% share in latest KDnuggets Poll, but a replacement for unmaintained CRISP-DM is long overdue.
Yes, people. Just because something’s popular, it doesn’t mean that it is automatically right. Since the title ‘data scientist’ is the new sexy, lots of inexperienced data scientists are rushing to use this model because it is the obvious one. I don’t think I’d be serving my customers well if I didn’t keep up-to-date, and that’s why I’m moving away from CRISP-DM to the Microsoft Team Data Science Process.
CRISP-DM also neglects aspects of decision making. James Taylor, a veteran of the PASS Business Analytics events, explains this issue in great detail in his blog series over at KDNuggets. If you haven’t read his work, or I recommend you read his article now and learn from his wisdom.
How does technology impinge on CRISP-DM?
Big Data technologies mean that there can be additional effort spend in the Data Understanding phase, for example, as the business grapples with the additional complexities that are involved in the shape of Big Data sources.
What comes after CRISP-DM? Enter the Team Data Science Process
The next framework, Microsoft’s Team Data Science Process framework, is aimed at including Big Data as a data source. As previously stated, the Data Understanding can be more complex.
Big Data and the Five Vs
There are debates about the number of Vs that apply to Big Data, but let’s go with Ray Wang’s definitions here. Given that our data can be subject to the five Vs as follows:
This means that our data becomes more confusing for business users to understand and process. This issue can easily distract the business team away from what they are trying to achieve. So, following the Microsoft Team Data Science process can help us to ensure that we have taken our five Vs into account, whilst keep things ticking along for the purpose of the business goal.
As we stated previously, CRISP-DM doesn’t seem to be actively maintained. With Microsoft dollars behind it, the Team Data Science process isn’t going away anytime soon.
What is the Team Data Science Process?
The process is shown in this diagram, courtesy of Microsoft:
The Team Data Science Process is loosely divided into five main phases:
- Business Understanding
- Data Acquisition and Understanding
- Modelling
- Deployment
- Customer Acceptance
Phase | Description |
---|---|
Business Understanding |
The Business Understanding process starts with a business idea, which is solved with a machine learning solution. A project plan is generated. |
Data Acquisition and Understanding | This important phase focuses fact-finding about the data. |
Modelling | The model is created, built and verified against the original business question. The metrics are evaluated against the key metrics. |
Deployment | The models are published to production, once they are proven to be a fit solution to the original business question |
Customer Acceptance | This process is the customer sign-off point. It confirm that the pipeline, the model, and their deployment in a production environment are satisfying customer objectives. |
The TSDP process itself is not linear; the output of the Data Acquisition and Understanding phase can feed back to the Business Understanding phase, for example. When the essential technical pieces start to appear, such as connecting to data, and the integration of multiple data sources then there may be actions arising from this effort.
The TDSP process is cycle rather than a linear process, and it does not finish, even if the model is deployed. Keep testing and evaluating that model!
TSDP Next Steps
There are a lot of how-to guides and downloads over at the TSDP website, so you should head over and take a look.
The Data Science ‘unicorn’ does not exist. Thanks to Hortonworks for their image below:
To mitigate this lack of Data Science unicorn, Team Data Science Summary is a team-oriented solutions which emphasize teamwork and collaboration throughout. It recognizes the importance of working as part of a team to deliver Data Science projects. It also offers useful information on the importance of having standardized source control and backups. It can include open source technology as well as Big Data technologies.
To summarise, the TSDP comprises of a clear structure for you to follow throughout the Data Science process, and facilitates teamwork and collaboration along the way.
Sorry but Data Science Unicorns do exist. I know a lot of them personally. Also you are quite wrong about Taylor’s article, it merely states that a corrupted process is a problem. The system is flexible and completely extensible. As for needing something “up to date”, do you use logistic or linear regression? Because if you do those techniques are over two centuries old and are based on the work of Legendre and Gauss. Big data has no impact on Data Mining methods, and a lot of what is being done now is complete rubbish. Statistical validity always comes first.
I totally agree with Mark
Excellent post Mark really the content you have is very useful for me to understand in a better way
Erm, it’s TDSP, not TSDP!
Also 1 year prior to Microsoft, IBM released ASUM DM, an update of their original methodology CRISP-DM, which you seem to ignore completely!
Thank you, Maruf, for your careful reading. If you’ve written any blog posts about the IBM offering, I’d love to read them. Why don’t you pop them in the comments?
I don’t understand your arguments against CRISP DM. Not having maintenance is not enough. If there is no development may be because it is already good enough. What are the main limitations in your opinion?
A few I’m trying like keeping lucky stone…and the like begets like.
There is some contradiction in this post. The shown Crisp-dm figure is from the once existing presidion company. These SPSS consultants where the origin of the crisp-dm. SPSS is owned by IBM (2020). The Watson product is having a “feedback” stage.
That block of monitoring after deployment is not in the original crisp-dm figure
https://developer.ibm.com/articles/introduction-watson-studio/