What’s wrong with CRISP-DM, and is there an alternative?

Many people, including myself, have discussed CRISP-DM in detail. However, I didn’t feel totally comfortable with it, for a number of reasons which I list below. Now I had raised a problem, I needed to find a solution and that’s where the Microsoft Team Data Science Process comes in. Read on for more detail!

  • What is CRISP-DM?
  • What’s wrong with CRISP-DM?
  • How does technology impinge on CRISP-DM?
  • What comes after CRISP-DM? Enter the Team Data Science Process?
  • What is the Team Data Science Process?

 

What is CRISP-DM?

One common methodology is the CRISP-DM methodology (The Modeling Agency). The Cross Industry Standard Process for Data Mining or (CRISP-DM) model as it is known, is a process framework for designing, creating, building, testing, and deploying machine learning solutions. The process is arranged into six phases. The phases can be seen in the following diagram:

crisp-dm-300x293

The phases are described below

 Phase  Description
Business
Understanding / Data Understanding
The first phase looks at the machine learning
solution from the business standpoint, rather than a technical standpoint.
Once the business concept is defined, the Data Understanding phase focuses on
data familiarity and collation.
Data Preparation In this stage, data will be cleansed and transformed, and it will be
shaped ready for the Modeling phase.
CRISP-DM modeling phase In the modeling phase, various techniques are applied to the data. The
models are further tweaked and refined, and this may involve going back to
the Data Preparation phase in order to correct any unexpected issues.
CRISP-DM evaluation The models need to be tested and verified to ensure that it meets the
business objectives that were defined initially in the business understanding
phase. Otherwise, we may have built a model that does not answer the business
question.
CRISP-DM deployment The models are published so that the customer can make use of them. This
is not the end of the story, however.

Then, the CRISP-DM process restarts. We live in a world of ever-changing data, business requirements, customer needs, and environments, and the process will be repeated.

CRISP-DM is the possibly the most well-known framework for implementing machine learning projects specifically.  It has a good focus on the business understanding piece.

What’s wrong with CRISP-DM?

The model no longer seems to be actively maintained. At the time of writing, the official site, CRISP-DM.org, is no longer being maintained. Further, the framework itself has not been updated on issues on working with new technologies, such as Big Data.

As a project leader, I want to keep up-to-date with the newest frameworks, and the newest technology. It’s true what they say; you won’t get a change until you make a chance.

The methodology itself was conceived in 1996, 21 years ago. I’m not the only one to come out and say so: industry veteran Gregory Piatetsky of KDNuggets had the following to say:

CRISP-DM remains the most popular methodology for analytics, data mining, and data science projects, with 43% share in latest KDnuggets Poll, but a replacement for unmaintained CRISP-DM is long overdue.

Yes, people. Just because something’s popular, it doesn’t mean that it is automatically right. Since the title ‘data scientist’ is the new sexy, lots of inexperienced data scientists are rushing to use this model because it is the obvious one. I don’t think I’d be serving my customers well if I didn’t keep up-to-date, and that’s why I’m moving away from CRISP-DM to the Microsoft Team Data Science Process.

CRISP-DM also neglects aspects of decision making. James Taylor, a veteran of the PASS Business Analytics events, explains this issue in great detail in his blog series over at KDNuggets. If you haven’t read his work, or  I recommend you read his article now and learn from his wisdom.

How does technology impinge on CRISP-DM?

Big Data technologies mean that there can be additional effort spend in the Data Understanding phase, for example, as the business grapples with the additional complexities that are involved in the shape of Big Data sources.

What comes after CRISP-DM? Enter the Team Data Science Process

The next framework, Microsoft’s Team Data Science Process framework, is aimed at including Big Data as a data source. As previously stated, the Data Understanding can be more complex.

Big Data and the Five Vs

There are debates about the number of Vs that apply to Big Data, but let’s go with Ray Wang’s definitions here. Given that our data can be subject to the five Vs as follows:

screen-shot-2012-02-19-at-11-51-19-pm-600x394

This means that our data becomes more confusing for business users to understand and process. This issue can easily distract the business team away from what they are trying to achieve. So, following the Microsoft Team Data Science process can help us to ensure that we have taken our five Vs into account, whilst keep things ticking along for the purpose of the business goal.

As we stated previously, CRISP-DM doesn’t seem to be actively maintained. With Microsoft dollars behind it, the Team Data Science process isn’t going away anytime soon.

What is the Team Data Science Process?

The process is shown in this diagram, courtesy of Microsoft:

tdsp-lifecycle

The Team Data Science Process is loosely divided into five main phases:

  • Business Understanding
  • Data Acquisition and Understanding
  • Modelling
  • Deployment
  • Customer Acceptance
 Phase  Description
Business
Understanding
The Business Understanding process starts with a business idea, which is solved with a machine learning solution. A project plan is generated.
Data Acquisition and Understanding This important phase focuses fact-finding about the data.
Modelling The model is created, built and verified against the original business question. The metrics are evaluated against the key metrics.
Deployment The models are published to production, once they are proven to be a fit solution to the original business question
Customer Acceptance This process is the customer sign-off point. It confirm that the pipeline, the model, and their deployment in a production environment are satisfying customer objectives.

 

The TSDP process itself is not linear; the output of the Data Acquisition and Understanding phase can feed back to the Business Understanding phase, for example. When the essential technical pieces start to appear, such as connecting to data, and the integration of multiple data sources then there may be actions arising from this effort.

The TDSP process is cycle rather than a linear process, and it does not finish, even if the model is deployed. Keep testing and evaluating that model!

TSDP Next Steps

There are a lot of how-to guides and downloads over at the TSDP website, so you should head over and take a look.

The Data Science ‘unicorn’ does not exist. Thanks to Hortonworks for their image below:

unicorn

To mitigate this lack of Data Science unicorn, Team Data Science Summary is a team-oriented solutions which emphasize teamwork and collaboration throughout. It recognizes the importance of working as part of a team to deliver Data Science projects. It also offers useful information on the importance of having standardized source control and backups. It can include open source technology as well as Big Data technologies.

To summarise, the TSDP comprises of a clear structure for you to follow throughout the Data Science process, and facilitates teamwork and collaboration along the way.

The Prodigal Developers Return: SQL Server 2016 SP1 brings consistent programming surface to Developers and ISVs

Big news from Microsoft Connect() 2016 online developer conference. SQL Server 2016 Service Pack 1 is dropping. Download SQL Server 2016 SP 1 here.

SQL Server 2016 SP1  means lots of wider features for lower editions. Most importantly, developers and partners can now build to a single application programming surface to create or upgrade new intelligent applications and use the edition which scales to the application’s needs.

The long version and my ‘take’ on this news:

I’m incredibly impressed with Microsoft right now. I think it’s incredibly smart, actually, because they are bringing developers and ISVs back into SQL Server Land again. So, developers, ISVs, go and grab yourself a coffee and let’s have a chat.

stocksnap_3tj6nctirt

Credit: stocksnap.io

SQL Server 2016 SP1 makes leading innovation available to any developer. Microsoft is making it easier for developers to benefit from the industry-leading innovations in SQL Server for more of their applications. With SQL Server 2016 SP1 is making key innovations more accessible to customers across editions. Developers and partners can now build to a single application programming surface to create or upgrade new intelligent applications and use the edition which scales to the application’s needs. SQL Server Enterprise continues to offer the highest levels of scale, performance and availability for enterprise workloads. For more information, please see the full press announcement on the SQL Server Blog. Visual Studio Code extension for SQL and updated connectors and tools are also exciting news, because it means that it’s easier to develop with other languages, in a more streamlined fashion.

What problem are Microsoft trying to fix?

stocksnap_vlhyvv3xu5Previously, the issue with developing applications for SQL Server is that there is a disparity across editions, which can affect how your application runs.  Until now, developers have used the SQL Server development version as it will allows them to develop with features that are available on all of the production versions.

Now, the problem is solved – developers can take advantage of the programmability feature by using the same code base, and things are simpler because the customer chooses which edition they use.

The problem was evident, when you use, say, an enterprise-only feature in development but have only a Standard-edition instance in Production. You can see the full list of features and editions published by Microsoft here ‘Features Supported by the Editions of SQL Server 2016’

If you had an app that can manage Enterprise edition then it can, in principle, also manage every other edition.  However, now the application would scale to the customer’s edition, thereby streamlining the whole process.

New Tools for the Toolbox, No Pricing Changes

stocksnap_kd30xpqr0a

So, developers wouldn’t have to build complexity, but they’d have to create their app the right way. For example, there’s not always a need to scale out. Let’s take Stack Overflow, one of the top 50 busiest sites in the world.  Stack Overflow runs on Microsoft SQL Server.

Not many people know it, but there is a StackOverflow Enterprise Edition. It means that companies like StackOverflow can take advantage of the new programmability features, if they so wished. I wonder what ISVs will do?

Freedom from Constraints

Let’s examine the issue in more detail. Let’s take a look at the SQL Server editions that are available to us:

  • Azure database + Amazon RDS
  • Containerized version of any edition
  • Developer Edition
  • Express Edition
  • Enterprise Edition
  • LocalDb
  • Standard Edition
  • Web Edition

You can see why it starts to get confusing, and developers might start to look at MySQL or Postgres as alternatives.

How can you get SQL Server 2016 SP1?

I believe that this will be a primary driver for SQL Server 2016 Service Pack 1, Download SQL Server 2016 SP 1 here.

Why are Microsoft doing this?

stocksnap_kikhw5nc6yIt’s a huge benefit for ISVs. It’s my opinion that Microsoft had lost the way with their partners. Customers started to look sideways at other vendors to fulfil their needs, such as Tableau. In response, partners expanded their toolkit in order to include crème de la crème vendors such as Tableau in order to build solutions. I think that this move is a gesture to the ISVs, since it will remove friction when they choose to develop solutions.

Being pals with Open Source but better – you get what you pay for. With the advent of open source, developers have got  more choice than ever before. It’s good to bring them back to SQL Server. Postgres doesn’t have in-memory capability, for example – it has “running with scissors” mode whereby you switch off all the disk storage features. Sound scary? Yes… the clue is in the name. SQL Server brings this feature to the party, and more. ISVs can feel more confident developing on a robust solution.

Increased productivity – it removes an obstacle to development, support and deployment.

The Prodigal Developers Return

This solution means that Microsoft SQL Server is back on the table for many developers, who may have started eyeing MySQL and Postgres for this reason.

To summarise, I think that this is a smart move and I’m excited to see that the ‘voice of the developer’ has come back into SQL Server Land. It’s also a huge benefit for ISV partners, and let’s see how they democratize their data in new and exciting applications. Let’s look for more exciting things coming from Microsoft.

Four years an MVP: what I’ve been doing, and what’s next?

Four things a woman should know: How to look like a girl, How to act like a lady, How to think like a man, and how to work like a dog.

I attended an event recently at which I was speaking. I met a few nice attendees who were new to the SQL Server scene, and I introduced myself briefly and listened whilst the other attendees talked. We had a nice chat, and one of them said, “it’s funny you’re called Jen Stirrup and you’re in BI…. There’s this MVP called Jen Stirrup that blogs about BI and she’s speaking here, and it would be great if I actually got to meet her’. I didn’t know what to say and whether to own up to being the one and the same. After a polite amount of time, I exited the conversation and moved gracefully on.

When I walked away, I was actually pleased. It seemed to me that the ‘actual’ Jen Stirrup was not the ‘perception’ of what these nice people thought that I should be: an MVP, a speaker, and perhaps not accessible because of these things.

Yes, I’m an MVP and a speaker. I’m delighted that I won the Microsoft MVP Award for the fourth consecutive year.

However…. Let’s get a sense of perspective. I don’t save babies. I don’t hold the hand of the dying. I don’t heal. I’m proud to say that members of my family do this, and when I think of what they do all day as part of their jobs, I’m immensely proud of them and I really do so little in comparison, and I am humbled by it. Their work makes mine look like nothing. 
I’m a volunteer for the community. I am not denigrating the Award; I am simply adding perspective. I’m happy (and extremely relieved) to have been awarded for another year. I spend all day on 1st July checking all my email accounts to see if the email has come in. It is the only email of the year that I call the email. It is a gift from Microsoft and I do value it immensely and consider myself extremely lucky.
I volunteer to serve the community. I don’t walk about like I own the place because I volunteer and do these things. I don’t want to be inaccessible. I recently held a Twitter surgery hour so folks could ask questions, and then Tom  LaRockand Denise McInerny from the PASS Board took the baton so we were available for an extended period of time. Truth is, I have great fun doing all this. My work is my passion, so it doesn’t feel like work. I’m writing this at 1AM on a Saturday night when most people are with loved ones, socialising or whatever. My work is my social life, my passion, and my hobby. I love what I do, and grateful and blessed.

When you attend an event, folks should definitely go up to speakers and chat. Speakers, by definition, like the sound of their own voices. We are not inaccessible and we love SQL Server, data, Excel and all sorts of techie stuff. So, what do I actually do, then?

Well, since elected for the PASS Board, I have done the following things as part of a team (i.e. not just me! I’m not a fairy with a magic wand).

  • I’ve brought the first Business Intelligence edition of a SQLSaturday to Europe.
  • I’ve brought the first Business Analytics edition of a SQLSaturday to Europe (more on this soon!)
  • I’ve done lots of internal PASS work on planning, strategy and so on, which is pretty invisible.
  • I’ve taken over the helm looking after the Virtual Chapters from Denise McInerny. Denise left the portfolio in such great shape, so this was easier than I thought it would be. Specifically, I have helped set up Global Hebrew, In Memory, Global French, Excel BI and I have also streamlined a few Virtual Chapters into more solid VCs. I’ve also worked on rejuvenating the Azure VC – now rebadged and reworked as Cloud VC. The Oracle VC is getting the next treatment.
  • I’ve spoken at lots of different events in the past year. I’ve spoken or helped at presentations and conferences in Amsterdam, Exeter (UK), Vienna, Bulgaria, Germany, London, Budapest, Cambridge (UK), Charlotte BA Edition, Paris, and of course SQLPass Summit in the US.
  • I also run my own User Group in Hertfordshire. I gave it the cute name HUGSS – Hertfordshire User Group for SQL Server. Nice, right? We could all do with a SQL hug now and again.
  • I also help Julie Koesmarno with the BI VC. Julie is an exemplary VC lead and I love working with her.
  • I also helped run SQLRelay last year, organising an event in Hertfordshire.
  • I’ve also run Women in Technology sessions in Lisbon, Exeter and we held our first one in Denmark this year.

I see the MVP Award as a gift from Microsoft. It is not something to be earned.

I do see it, from my perspective, as a pass which I use which makes it easier for me to do good things for people in the technical community. It simply makes things easier for me to speak to Microsoft team members and other people in the community. I gladly accept the MVP Award, and I do something with it.

So I don’t sit on my laurels when I look at the MVP Award. Instead, I think of what I’m going to do next that will be a community good.

I thank the great team at Microsoft for giving me this gift. I hope that I repay their trust by taking this Award and trying to do good things for the SQLFamily.

That does not mean I always get things right. I know that I don’t.

In case you want to email me about any of these comments, or anything about PASS in particular, my email is jen.stirrup@sqlpass.organd I look forward to hearing from you.
 
Last but not least, I have a whole group of people to thank:
The PASS team: Lana, Vicki, Amy L, Elizabeth, Karla and Carmen – for keeping me right and having endless patience with me, and for being tremendously smart. You know you get email from people and you think to yourself “that’s really clever, why didn’t I think of that?” whilst giving yourself a facepalm? Well this great team does this to me every week, at least once. Usually much more than that!
The PASS Board: the whole team. They continue to inspire me with their leadership and commitment, and I remain overawed by how smart they are.
People who put up with me: Allan Mitchell, my business partner for endlessly humouring me (although he puts  the word long-suffering in front of the title ‘business partner’ and I have no idea why), Mark Broadbent (who genuinely is long-suffering with all the patience he shows me, and I’m so very grateful for his help. He knows I’m terrible at delegating). James Rowland-Jones for constant wisdom and a great ‘ear’.
And of course Microsoft. They really care about users loving their products and I hope I can help users to learn how to do stuff better. Oh, and have fun with data.
And my son. The most kind-hearted, gentle, precious boy who is turning into an upright, steadfast, responsible young man. He didn’t have the best start in life due to various illnesses, and there were times when I (and the doctors) didn’t think he’d make it through the night. But he did, and he deserves everything good I can give him.
I am sure that there are others! All these events and activities: I could not do these without lots of other people. Thank you for putting up with me.
What’s next? Well, a SQLSaturday Business Analytics edition. You will have to watch this space for more details.
 
 

 

Watch Microsoft Build 2014 Conference Keynote online

Microsoft’s developer conference, Microsoft build developer conference 2014,  is sold out. But you can still watch it online and immerse yourself in what’s next for the Microsoft platform and tools. Streamed keynotes start at 8:30 A.M. PDT on Wednesday, April 2. Most sessions will also be available on demand within 24 hours.

Data Visualisation with Hadoop, Hive, Power BI and Excel 2013 – Slides from SQLPass Summit and SQLSaturday Bulgaria

I presented this session at SQLPass Summit 2013 and at SQLSaturday Bulgaria.

The topic focuses on some data visualisation theory, an overview of Big Data and finalises the Microsoft distribution of Hadoop. I will try to record the demo as part of a PASS Business Intelligence Virtual Chapter online webinar at some point, so please watch this space.

I hope you enjoy and I look forward to your feedback.

Leadership Styles: My perspective on how to say no to ideas

Denise McInerny posed the following question on the PASS Election Discussion Board, and I have posted my answer here:

PASS has a lot of passionate and creative people with many good ideas. Like all organizations we have finite resources, which means we can’t do everything we want to do.One of the hardest things about being on the Board is saying “no” to a good idea. How would you approach that aspect of the job?

Let me give you an example recently where an email precipitated a huge and very heated community debate – the closure of the MCM program. Although I was not part of the decision-making at all, I was part of the process of the communication around the closure of the MCM Program because I chaired a conference call between Microsoft and the MCM community. For some reason, the Register obtained a copy of it without my knowledge, but it was supposed to be restricted to the MCM community.

In order to understand more about why the decision to close the MCM happened and to facilitate conversation and discussion between the community and Microsoft, I opened a Connect case, which ended up being the highest-voted SQL Server connect case with over a staggering 800 upvotes.
By opening a Connect case, I opened a two-way conversation which, unfortunately, ended up turning sour as people vented a very personal series of criticism on individual community members, which I will not deign to repeat here. Due to this, the Connect case was closed, unfortunately, since the Case was being dragged around by a tiny but extremely vocal minority who felt a Connect case was an appropriate forum to make personal and wholly unfounded criticisms of people who worked at Microsoft, or were attached to the Community in some way.

I then worked with Microsoft in order to host a conference call with the MCM community, whom I deeply respect. Despite the presence of the trolls on the Connect case, it was clear that there were a number of extremely smart engaged people, who had great ideas about the way forward for the MCM program and for MSL in particular. This was in despite of their huge personal disappointment at the closure of the program, which many had spent a lot of money, time and effort in participating.

I chaired the call between the MCM community and Microsoft, collating questions over a number of days and distilling them into a number of common themes due to the repetition of some questions.

Although the call did not produce the outcome that many wanted, it was at least a way forward for facilitating communication between Microsoft and the MCM community in a more formal environment, which reduced the heat of the Connect case which had been hijacked by trolls. It at least gave a voice to the MCM people who really deserved it, and had great questions and comments about the MCM closure decision, and plans for the way forward.

To summarise, this is an example where I’ve played a part in trying to resolve a very heated community situation, through communication, active participation in the community, and an absolute belief that the good hearts and best minds in the community deserved a hearing, as well as allowing Microsoft to have a say. Incidentally I’d like to thank Tim Sneath and his team for his time for making the time and facilities available to make the communication happen. I also found a way forward to deal with the trolls who were hijacking the normal means of communication i.e. by comments fired to a Connect case.

It was one of these situations where people deserved more than an email, and I think it was right to make it happen. I think that a ‘copy and paste’ email misses the point somewhat, since it does not seem to echo the idea of listening to the individual(s), or taking them seriously. Getting a somewhat modified template answer just doesn’t seem to fit with the energy that people have put into bringing an idea to you.

Saying no can be hard, but if you can clarify ‘why not’, then it can help to reach a common ground between yourself and the community. Sometimes what you mean is ‘not yet’. Communication, and fair communication which isn’t one-sided (like an email) isn’t the way forward.

In my experience, it is too easy to email, and much harder to pick up the phone or do in-person – but the effort can be worth it. It can come across as disrespectful, even. Also, if it is a bad idea that morphs into a good idea after discussion, it is important to give credit where it is due.

I propose that sometimes picking up the phone, or a proper conference call, might be the way forward. It depends on lots of factors, such as the range of the idea, numbers affected, how the idea generators might take it, and so on.

Whilst it is important not to get dragged around by a vocal minority, sometimes a simple conversation is all that it takes, and in today’s connected world, there is no excuse not to do that.

MCM Connect: Please stay constructive. Let the clever people shine through.

Recently, I raised a Connect item about the MCM program.
 
Please, let’s continue to debate with professionalism, dignity, insights and, above all, grace.
 
The point of the Connect item was to have a constructive debate about the demise of the MCM program. I felt it was a good way for people to feedback to Microsoft openly and freely about their views, either for or against the decision.

I believe that the MCM community has smart, clever dedicated people in it, and I hoped to consolidate that Community – plus the wider MCM community in SharePoint, Exchange etc – so the viewpoints could be made visible. By throwing it out to the open, I believed that the Community would have good ideas and insights about the Program that Microsoft may not have considered. I value the genius in the Community.

However, I have been utterly disappointed that my noble? naïve? venture started to descend to personal criticisms on individuals in the Community and who work for Microsoft. This was not my intention at all. This turn of events was not what I expected from people from a Community that I respect. I speak as an individual, and as someone who has been fortunate enough to hold the MVP Award for SQL Server for the past three years.

I would ask that the commentary remains professional or constructive. If it does not, I will speak to Microsoft about closing the Connect case, and you can thank the trollers who have forced me towards this course of action.

 
If people want to troll, my email address is jenstirrup [at] jenstirrup [dot] com and you can come and find me. I will absorb poison if it means that it is deflected away from my #SQLFamily.
 
I hope that the genuine people and great ideas and debate won’t be drowned out by trolls. They deserve better.