What’s wrong with CRISP-DM, and is there an alternative?

Many people, including myself, have discussed CRISP-DM in detail. However, I didn’t feel totally comfortable with it, for a number of reasons which I list below. Now I had raised a problem, I needed to find a solution and that’s where the Microsoft Team Data Science Process comes in. Read on for more detail!

  • What is CRISP-DM?
  • What’s wrong with CRISP-DM?
  • How does technology impinge on CRISP-DM?
  • What comes after CRISP-DM? Enter the Team Data Science Process?
  • What is the Team Data Science Process?

 

What is CRISP-DM?

One common methodology is the CRISP-DM methodology (The Modeling Agency). The Cross Industry Standard Process for Data Mining or (CRISP-DM) model as it is known, is a process framework for designing, creating, building, testing, and deploying machine learning solutions. The process is arranged into six phases. The phases can be seen in the following diagram:

crisp-dm-300x293

The phases are described below

 Phase  Description
Business
Understanding / Data Understanding
The first phase looks at the machine learning
solution from the business standpoint, rather than a technical standpoint.
Once the business concept is defined, the Data Understanding phase focuses on
data familiarity and collation.
Data Preparation In this stage, data will be cleansed and transformed, and it will be
shaped ready for the Modeling phase.
CRISP-DM modeling phase In the modeling phase, various techniques are applied to the data. The
models are further tweaked and refined, and this may involve going back to
the Data Preparation phase in order to correct any unexpected issues.
CRISP-DM evaluation The models need to be tested and verified to ensure that it meets the
business objectives that were defined initially in the business understanding
phase. Otherwise, we may have built a model that does not answer the business
question.
CRISP-DM deployment The models are published so that the customer can make use of them. This
is not the end of the story, however.

Then, the CRISP-DM process restarts. We live in a world of ever-changing data, business requirements, customer needs, and environments, and the process will be repeated.

CRISP-DM is the possibly the most well-known framework for implementing machine learning projects specifically.  It has a good focus on the business understanding piece.

What’s wrong with CRISP-DM?

The model no longer seems to be actively maintained. At the time of writing, the official site, CRISP-DM.org, is no longer being maintained. Further, the framework itself has not been updated on issues on working with new technologies, such as Big Data.

As a project leader, I want to keep up-to-date with the newest frameworks, and the newest technology. It’s true what they say; you won’t get a change until you make a chance.

The methodology itself was conceived in 1996, 21 years ago. I’m not the only one to come out and say so: industry veteran Gregory Piatetsky of KDNuggets had the following to say:

CRISP-DM remains the most popular methodology for analytics, data mining, and data science projects, with 43% share in latest KDnuggets Poll, but a replacement for unmaintained CRISP-DM is long overdue.

Yes, people. Just because something’s popular, it doesn’t mean that it is automatically right. Since the title ‘data scientist’ is the new sexy, lots of inexperienced data scientists are rushing to use this model because it is the obvious one. I don’t think I’d be serving my customers well if I didn’t keep up-to-date, and that’s why I’m moving away from CRISP-DM to the Microsoft Team Data Science Process.

CRISP-DM also neglects aspects of decision making. James Taylor, a veteran of the PASS Business Analytics events, explains this issue in great detail in his blog series over at KDNuggets. If you haven’t read his work, or  I recommend you read his article now and learn from his wisdom.

How does technology impinge on CRISP-DM?

Big Data technologies mean that there can be additional effort spend in the Data Understanding phase, for example, as the business grapples with the additional complexities that are involved in the shape of Big Data sources.

What comes after CRISP-DM? Enter the Team Data Science Process

The next framework, Microsoft’s Team Data Science Process framework, is aimed at including Big Data as a data source. As previously stated, the Data Understanding can be more complex.

Big Data and the Five Vs

There are debates about the number of Vs that apply to Big Data, but let’s go with Ray Wang’s definitions here. Given that our data can be subject to the five Vs as follows:

screen-shot-2012-02-19-at-11-51-19-pm-600x394

This means that our data becomes more confusing for business users to understand and process. This issue can easily distract the business team away from what they are trying to achieve. So, following the Microsoft Team Data Science process can help us to ensure that we have taken our five Vs into account, whilst keep things ticking along for the purpose of the business goal.

As we stated previously, CRISP-DM doesn’t seem to be actively maintained. With Microsoft dollars behind it, the Team Data Science process isn’t going away anytime soon.

What is the Team Data Science Process?

The process is shown in this diagram, courtesy of Microsoft:

tdsp-lifecycle

The Team Data Science Process is loosely divided into five main phases:

  • Business Understanding
  • Data Acquisition and Understanding
  • Modelling
  • Deployment
  • Customer Acceptance
 Phase  Description
Business
Understanding
The Business Understanding process starts with a business idea, which is solved with a machine learning solution. A project plan is generated.
Data Acquisition and Understanding This important phase focuses fact-finding about the data.
Modelling The model is created, built and verified against the original business question. The metrics are evaluated against the key metrics.
Deployment The models are published to production, once they are proven to be a fit solution to the original business question
Customer Acceptance This process is the customer sign-off point. It confirm that the pipeline, the model, and their deployment in a production environment are satisfying customer objectives.

 

The TSDP process itself is not linear; the output of the Data Acquisition and Understanding phase can feed back to the Business Understanding phase, for example. When the essential technical pieces start to appear, such as connecting to data, and the integration of multiple data sources then there may be actions arising from this effort.

The TDSP process is cycle rather than a linear process, and it does not finish, even if the model is deployed. Keep testing and evaluating that model!

TSDP Next Steps

There are a lot of how-to guides and downloads over at the TSDP website, so you should head over and take a look.

The Data Science ‘unicorn’ does not exist. Thanks to Hortonworks for their image below:

unicorn

To mitigate this lack of Data Science unicorn, Team Data Science Summary is a team-oriented solutions which emphasize teamwork and collaboration throughout. It recognizes the importance of working as part of a team to deliver Data Science projects. It also offers useful information on the importance of having standardized source control and backups. It can include open source technology as well as Big Data technologies.

To summarise, the TSDP comprises of a clear structure for you to follow throughout the Data Science process, and facilitates teamwork and collaboration along the way.

Guess who is appearing in Joseph Sirosh’s PASS Keynote?

This girl! I am super excited and please allow me to have one little SQUUEEEEEEE! before I tell you what’s happening. Now, this is a lifetime achievement for me, and I cannot begin to tell you how absolutely and deeply honoured I am. I am still in shock!

I am working really hard on my demo and….. I am not going to tell you what it is. You’ll have to watch it. Ok, enough about me and all I’ll say is two things: it’s something that’s never been done at PASS Summit before and secondly, watch the keynote because there may be some discussion about….. I can’t tell you what… only that, it’s a must-watch, must-see, must do keynote event.

We are in a new world of Data and Joseph Sirosh and the team are leading the way. Watching the keynote will mean that you get the news as it happens, and it will help you to keep up with the changes. I do have some news about Dr David DeWitt’s Day Two keynote… so keep watching this space. Today I’d like to talk about the Day One keynote with the brilliant Joseph Sirosh, CVP of Microsoft’s Data Group.

Now, if you haven’t seen Joseph Sirosh present before, then you should. I’ve put some of his earlier sessions here and I recommend that you watch them.

Ignite Conference Session

MLDS Atlanta 2016 Keynote

I hear you asking… what am I doing in it? I’m keeping it a surprise! Well, if you read my earlier blog, you’ll know I transitioned from Artificial Intelligence into Business Intelligence and now I do a hybrid of AI and BI. As a Business Intelligence professional, my customers will ask me for advice when they can’t get the data that they want. Over the past few years, the ‘answer’ to their question has gone far, far beyond the usual on-premise SQL Server, Analysis Services, SSRS combo.

We are now in a new world of data. Join in the fun!

Customers sense that there is a new world of data. The ‘answer’ to the question Can you please help me with my data?‘ is complex, varied and it’s very much aimed at cost sensitivities, too. Often, customers struggle with data because they now have a Big Data problem, or a storage problem, or a data visualisation access problem. Azure is very neat because it can cope with all of these issues. Now, my projects are Business Intelligence and Business Analytics projects… but they are also ‘move data to the cloud’ projects in disguise, and that’s in response to the customer need. So if you are Business Intelligence professional, get enthusiastic about the cloud because it really empowers you with a new generation of exciting things you can do to please your users and data consumers.

As a BI or an analytics professional, cloud makes data more interesting and exciting. It means you can have a lot more data, in more shapes and sizes and access it in different ways. It also means that you can focus on what you are good at, and make your data estate even more interesting by augmenting it with cool features in Azure. For example, you could add in more exciting things such as Apache Tika library as a worker role in Azure to crack through PDFs and do interesting things with the data in there. If you bring it into SSIS, then you can tear it up and down again when you don’t need it.

I’d go as far as to say that, if you are in Business Intelligence at the moment, you will need to learn about cloud sooner or later. Eventually, you’re going to run into Big Data issues. Alternatively, your end consumers are going to want their data on a mobile device, and you will want easy solutions to deliver it to them. Customers are interested in analytics and the new world of data and you will need to hop on the Azure bus to be a part of it.

The truth is; Joseph Sirosh’s keynotes always contain amazing demos. (No pressure, Jen, no pressure….. ) Now, it’s important to note that these demos are not ‘smoke and mirrors’….

The future is here, now. You can have this technology too.

It doesn’t take much to get started, and it’s not too far removed from what you have in your organisation. AzureML and Power BI have literally hundreds of examples. I learned AzureML looking at the following book by Wee-Hyong Tok and others, so why not download a free book sample?

https://read.amazon.co.uk/kp/card?asin=B00MBL261W&preview=inline&linkCode=kpe&ref_=cm_sw_r_kb_dp_c54ayb2VHWST4

How do you proceed? Well, why not try a little homespun POC with some of your own data to learn about it, and then show your boss. I don’t know about you but I learn by breaking things, and I break things all the time when I’m  learning. You could download some Power BI workbooks, use the sample data and then try to recreate them, for example. Or, why not look at the community R Gallery and try to play with the scripts. you broke something? no problem! Just download a fresh copy and try again. You’ll get further next time.

I hope to see you at the PASS keynote! To register, click here: http://www.sqlpass.org/summit/2016/Sessions/Keynotes.aspx 

Jen’s Diary: Overcoming the Power of Feathers through Action

I’m sorry I haven’t kept this diary up to date: I’ve been at SQLBits, SQLPass Nordic, Data Culture events, and other community gatherings. I’ve also had dental surgery, involving the removal of two teeth and the removal of some of my lower jaw. I haven’t been very happy, needless to say.

As always, I don’t represent PASS or any other organisation throughout this blog.

FYI I’ll be holding a Twitter Surgery Hour on Friday 20th March at 12pm GMT so please tweet me at @jenstirrup and ask me whatever you like! If you are not sure what time that is in your time zone, please check here.

never+not+broken[1]I’m not Hindi, but I’m inspired by the HIndu goddess Akhilanda, which means essentially “never not broken.” In other words, The Always Broken Goddess. Sanskrit is a tricky but amazing language. Here, we see a double negative here means that Akhilanda is broken right down to her name. The thing is, being broken is actually part of a renewal process, and it means that your broken parts can shine out more brightly than ever before – simply because you are broken, and moving, and reflect light out wide. It means you can pick up your pieces, and run without limits.

Outwardly, I am not an obvious leader. I realise I am a quiet person. I don’t party. I very rarely drink alcohol. I am not much of a dinner date – I am a forty something single mother, with no real hobbies or interests other than technology. I have never been someone to write home about. I am my own person.

People can know you by your reputation, but your actions can speak louder than words. Let me give you an example: a Jewish tale talks about a man, who went about the community telling gossip about the rabbi. Later, he realized the wrong he had done, and began to feel remorse. He went to the rabbi and wanted to apologise, saying he would do anything he could to make amends. The rabbi told the man, “Take a feather pillow, cut it open, and scatter the feathers to the winds.” When he returned to tell the rabbi that he had done it, the rabbi said, “Now, go and gather the feathers. Because you can no more make amends for the damage your words have done than you can recollect the feathers.”

Why do I write this blog? I want to give people something other than feathers, what is said about me, whether it is good or bad. Instead, I want to demonstrate real actions that prove me, demonstrate unequivocally that I work for the community, and work hard. Instead, I hope that these diary post series will overcome the power of feathers – whether they are good or bad – and shine out what I actually do for the community. Actions speak louder than words, and this is how we overcome the power of feathers – let your light shine more brightly than ever, even if you are not perfect but broken. For me, I work really hard for the community and hope that the actions demonstrate that.

I’m hoping that people will see how hard I’ve worked for the community, and they will hold onto that data – we are in a data-driven decade, right? – and they will see the data for what it is. So, what have I been doing?

Ongoing – PASS BA Conference. I attend conference calls for about three hours a week, some weeks, up to seven hours, and these calls are held from 9pm my time onwards. Obviously, that takes out a few of my evenings a week. I obviously have work to to outside of this time, such as sponsorship, blogging on numerous occasions, email.

Ongoing – I have been holding PASS Data Science Virtual Chapter webinars every Friday night, 9pm my time. This obviously takes out my Friday night, in additional to the BAC calls I mention above, and I have been doing this for four consecutive weeks now. This week is the last week, and I’m considering doing a five week introductory course to Python and/or AzureML after that. I’ve held lots of Virtual Chapter meetings before, particularly since I held that Portfolio for PASS last year. My most popular sessions were on data visualisation, personally, but I have noted that my audience has really increased over the last five weeks and I’d like to continue to offer that community support.

March 2015 – I tried to hold another DiTBits (Diversity in Technology) event at SQLBits, but, alas, this wasn’t to be. The event wasn’t marketed and it seemed a pity to try and shoehorn something in at the last moment. Next time, I hope.

March 2015 – I held a two hour R and Python for SQL and Business Intelligence Professionals session at SQLBits. Files and notes to follow (see earlier note about my surgery!) It was extremely well attended.

March 2015 – I held a general session on Data Visualisation, entitled Eye Vegetables and Eye Candy: How to Visualise your Data at SQLBits. Again, notes to follow (I’m still recovering from surgery…)

March 2015 – I did half of a joint session at SQLBits with Allan Mitchell, where I talked about Kibana. The topic was Building Scaleable Analytical Solutions on Azure and it was fun presenting with Allan.

March 2015 – I spoke at SQLRally Nordic on Pulling back the green curtain: Data Forensics, Power Bi and Dataviz.

March 2015 – I held a Women in Technology event at SQLRally Nordic, talking about the importance of diverse teams.

March 2015 – I interviewed Mico Yuk as part of our ‘Meet the Experts’ series and you can listen to the podcast here. I’m really looking forward to meeting her!

Feb 2015 – Ongoing – I worked to help get the Data Science Virtual Chapter off the ground, led by Mark Tabladillo, who is the VC lead. This has involved groundwork, phone calls, and working with Microsoft to get speakers. You should really register for the Data Science VC. It’s great fun!

Feb 2015 – Data Culture – I was the keynote speaker for Microsoft, which was great fun! The slideshare is below. My section is about the middle third or so; credits are on the slides for the other speakers.

Feb 2015 – Ongoing – I kicked off my series Intro to R series of webinars for the Data Science Virtual Chapter
Feb 2015 – I ramped up my SQLSaturday Edinburgh in earnest, which has a Business Intelligence focus.
Feb 2015 – I held an Excel BI VC session on cubes and Excel. I also hosted another session as a mentor.
Feb 2015 – Techdays – I held a webinar on AzureML to over six thousand people. Nervous? Yes, bordering on terror. Great thanks to Andrew Fryer (I’m honoured to call him my friend) for all of his support and he’s an inspiration to me, and I was glad to be his presenting small person for the day.
Feb 2015 – Hants UG – on the same day that I held the Techdays session, I left and travelled to Hampshire to deliver a session on AzureML.
I hope it’s clear that, in the past few months, I have been doing a lot of work for the community as well as my PASS Board responsibilities.
So, if you catch a feather and want to ask me about it, my email is jen.stirrup@datarelish.com or tweet me during my Twitter Hour on the 20th March. I look forward to your input. Just go ahead and ask me!

Future Decoded – I’m speaking on machine Learning

1024x600-Speaker-briancox

I’m speaking at Future Decoded, at the same event as Professor Brian Cox, Sir Nigel Shadbolt (Co-founder & Chairman, ODI Open Data Institute), Or Arbel, (CEO, Yo), Michael Taylor (IT Director, Lotus F1 Team), Kenji Takeda, Microsoft Research, and my good friends Chris Webb and James Rowland-Jones

To Register, here’s the link http://www.microsoft.com/en-gb/about/future-decoded-techday