What’s wrong with CRISP-DM, and is there an alternative?

Many people, including myself, have discussed CRISP-DM in detail. However, I didn’t feel totally comfortable with it, for a number of reasons which I list below. Now I had raised a problem, I needed to find a solution and that’s where the Microsoft Team Data Science Process comes in. Read on for more detail!

  • What is CRISP-DM?
  • What’s wrong with CRISP-DM?
  • How does technology impinge on CRISP-DM?
  • What comes after CRISP-DM? Enter the Team Data Science Process?
  • What is the Team Data Science Process?

 

What is CRISP-DM?

One common methodology is the CRISP-DM methodology (The Modeling Agency). The Cross Industry Standard Process for Data Mining or (CRISP-DM) model as it is known, is a process framework for designing, creating, building, testing, and deploying machine learning solutions. The process is arranged into six phases. The phases can be seen in the following diagram:

crisp-dm-300x293

The phases are described below

 Phase  Description
Business
Understanding / Data Understanding
The first phase looks at the machine learning
solution from the business standpoint, rather than a technical standpoint.
Once the business concept is defined, the Data Understanding phase focuses on
data familiarity and collation.
Data Preparation In this stage, data will be cleansed and transformed, and it will be
shaped ready for the Modeling phase.
CRISP-DM modeling phase In the modeling phase, various techniques are applied to the data. The
models are further tweaked and refined, and this may involve going back to
the Data Preparation phase in order to correct any unexpected issues.
CRISP-DM evaluation The models need to be tested and verified to ensure that it meets the
business objectives that were defined initially in the business understanding
phase. Otherwise, we may have built a model that does not answer the business
question.
CRISP-DM deployment The models are published so that the customer can make use of them. This
is not the end of the story, however.

Then, the CRISP-DM process restarts. We live in a world of ever-changing data, business requirements, customer needs, and environments, and the process will be repeated.

CRISP-DM is the possibly the most well-known framework for implementing machine learning projects specifically.  It has a good focus on the business understanding piece.

What’s wrong with CRISP-DM?

The model no longer seems to be actively maintained. At the time of writing, the official site, CRISP-DM.org, is no longer being maintained. Further, the framework itself has not been updated on issues on working with new technologies, such as Big Data.

As a project leader, I want to keep up-to-date with the newest frameworks, and the newest technology. It’s true what they say; you won’t get a change until you make a chance.

The methodology itself was conceived in 1996, 21 years ago. I’m not the only one to come out and say so: industry veteran Gregory Piatetsky of KDNuggets had the following to say:

CRISP-DM remains the most popular methodology for analytics, data mining, and data science projects, with 43% share in latest KDnuggets Poll, but a replacement for unmaintained CRISP-DM is long overdue.

Yes, people. Just because something’s popular, it doesn’t mean that it is automatically right. Since the title ‘data scientist’ is the new sexy, lots of inexperienced data scientists are rushing to use this model because it is the obvious one. I don’t think I’d be serving my customers well if I didn’t keep up-to-date, and that’s why I’m moving away from CRISP-DM to the Microsoft Team Data Science Process.

CRISP-DM also neglects aspects of decision making. James Taylor, a veteran of the PASS Business Analytics events, explains this issue in great detail in his blog series over at KDNuggets. If you haven’t read his work, or  I recommend you read his article now and learn from his wisdom.

How does technology impinge on CRISP-DM?

Big Data technologies mean that there can be additional effort spend in the Data Understanding phase, for example, as the business grapples with the additional complexities that are involved in the shape of Big Data sources.

What comes after CRISP-DM? Enter the Team Data Science Process

The next framework, Microsoft’s Team Data Science Process framework, is aimed at including Big Data as a data source. As previously stated, the Data Understanding can be more complex.

Big Data and the Five Vs

There are debates about the number of Vs that apply to Big Data, but let’s go with Ray Wang’s definitions here. Given that our data can be subject to the five Vs as follows:

screen-shot-2012-02-19-at-11-51-19-pm-600x394

This means that our data becomes more confusing for business users to understand and process. This issue can easily distract the business team away from what they are trying to achieve. So, following the Microsoft Team Data Science process can help us to ensure that we have taken our five Vs into account, whilst keep things ticking along for the purpose of the business goal.

As we stated previously, CRISP-DM doesn’t seem to be actively maintained. With Microsoft dollars behind it, the Team Data Science process isn’t going away anytime soon.

What is the Team Data Science Process?

The process is shown in this diagram, courtesy of Microsoft:

tdsp-lifecycle

The Team Data Science Process is loosely divided into five main phases:

  • Business Understanding
  • Data Acquisition and Understanding
  • Modelling
  • Deployment
  • Customer Acceptance
 Phase  Description
Business
Understanding
The Business Understanding process starts with a business idea, which is solved with a machine learning solution. A project plan is generated.
Data Acquisition and Understanding This important phase focuses fact-finding about the data.
Modelling The model is created, built and verified against the original business question. The metrics are evaluated against the key metrics.
Deployment The models are published to production, once they are proven to be a fit solution to the original business question
Customer Acceptance This process is the customer sign-off point. It confirm that the pipeline, the model, and their deployment in a production environment are satisfying customer objectives.

 

The TSDP process itself is not linear; the output of the Data Acquisition and Understanding phase can feed back to the Business Understanding phase, for example. When the essential technical pieces start to appear, such as connecting to data, and the integration of multiple data sources then there may be actions arising from this effort.

The TDSP process is cycle rather than a linear process, and it does not finish, even if the model is deployed. Keep testing and evaluating that model!

TSDP Next Steps

There are a lot of how-to guides and downloads over at the TSDP website, so you should head over and take a look.

The Data Science ‘unicorn’ does not exist. Thanks to Hortonworks for their image below:

unicorn

To mitigate this lack of Data Science unicorn, Team Data Science Summary is a team-oriented solutions which emphasize teamwork and collaboration throughout. It recognizes the importance of working as part of a team to deliver Data Science projects. It also offers useful information on the importance of having standardized source control and backups. It can include open source technology as well as Big Data technologies.

To summarise, the TSDP comprises of a clear structure for you to follow throughout the Data Science process, and facilitates teamwork and collaboration along the way.

Note to Self: A roundup of the latest Azure blog posts and whitepapers on polybase, network security, cloud services, Hadoop and Virtual Machines

Here is a roundup of Azure blogs and whitepapers which I will be reading this month.

This is the latest as at June 2014, and there is a focus on cloud security in the latest whitepapers, which you can find below..

·         PolyBase in APS – Yet another SQL over Hadoop solution?
·         Desktop virtualization deployment overview
·         Microsoft updates its Hadoop cloud solution
·         LG CNS build a B2B virtual computer service in the cloud
·         Deploying desktop virtualization
·         Microsoft updates its Hadoop cloud solution
·         Accessing desktop virtualization
·         The visualization that changed the world of data
·         Access and Information Protection: Setting up the environment
·         Access and Information Protection: Making resources available to users
·         Access and Information Protection: Simple registration for BYOD devices
·         Success with Hybrid Cloud webinar series
·         Power BI May round-up
·         Access and Information Protection: Syncing and protecting corporate information

Here are the latest whitepapers, which focus on security:

 
Windows Azure Security: Technical Insights. Update to the Security Overview whitepaper which provides a detailed description of security features and controls.
  • Security Best Practices for Windows Azure Solutions. Updated guidance on designing and developing secure solutions.
  • Windows Azure Network Security. Recommendations for securing network communications for applications deployed in Windows Azure.
  • Microsoft Antimalware for Azure Cloud Services and Virtual Machines This paper details how to use Microsoft Antimalware to help identify and remove viruses, spyware, and other malicious software in Azure Cloud Services and Virtual Machines.
  • Data Visualisation with Hadoop, Hive, Power BI and Excel 2013 – Slides from SQLPass Summit and SQLSaturday Bulgaria

    I presented this session at SQLPass Summit 2013 and at SQLSaturday Bulgaria.

    The topic focuses on some data visualisation theory, an overview of Big Data and finalises the Microsoft distribution of Hadoop. I will try to record the demo as part of a PASS Business Intelligence Virtual Chapter online webinar at some point, so please watch this space.

    I hope you enjoy and I look forward to your feedback.

    Hadoop Summit Europe 2014 Call for Abstracts is now open

    Hadoop Summit Europe 2014 Call for Abstracts is now open

    If you are interested in registering, please click here. Good luck! The call for Abstracts for the EMEA Hadoop Summit is now officially open. FYR the closing date is 31st October 2013.
    Who should submit? If you are a developer, architect, administrator, data analyst, data scientist, IT or business leader or otherwise involved with Apache Hadoop and have a compelling topic that you would like to present at Hadoop Summit, the Hadoop folks will welcome your submission. 
    What do you get?
    Kudos.
    Being part of a fun Hadoop crowd! Being a speaker is all kinds of cool. (For me, this is the best bit!)
    All presenters receive a complimentary all-access pass to Hadoop Summit so you get to learn too. 
    What should you submit?
    The content selection committee is particularly interested in compelling use cases and success stories, best practices, cautionary tales and technology insights that help to advance the adoption of Apache Hadoop.
    Tracks this year include:
    • Committer – Speakers in this track are restricted to committers across all Hadoop-related Apache projects only and content will be curated by a group of senior committers
    • The Future of Apache Hadoop – Investigating the key projects, incubation projects and the industry initiatives driving Hadoop innovation.
    • Data Science & Hadoop – Discussing applications, tools, and algorithms, research and emerging applications that use and extend the Hadoop platform for data science
    • Hadoop Deployment & Operations – Focusing on deployment, operation and administration of Hadoop clusters at scale, with an emphasis on tips, tricks, best practices and war stories
    • Hadoop for Business Applications and Development – Presentation topics that discuss the languages, tools, techniques, and solutions for deriving business value from data.

    Submission deadline October 31, 2013.

    Good luck!

    Eating the elephant, one bite at a time: Loading data using Hive

    In the previous ‘Eating the elephant’ blogs, we’ve talked about tables and their implementation. Now, we will look at one of the ways to get data into a table. There are different ways to do this, but here we will look only at getting data into an external Hive table using HiveQL, which is the Hive query language. Sounds similar to SQL, doesn’t it? Well, that’s because a lot of it looks and smells similar. When you look at this example here, you will see what I mean!

    Here is the Hive syntax to create an external table. In this post, we will look at simple examples, and perhaps look at partitioned tables on another occasion.

    First things first; you need to create an external table to hold your data. Before you do, check the contents of the folder /user/hue/UNICEF if it exists. If not, don’t worry, we have the rest of the syntax below:

    Create your table

    CREATE EXTERNAL TABLE IF NOT EXISTS UNICEFReportCard (
    CountryName     STRING,
    AvgRankPosition     FLOAT,
    MaterialWellBeing     FLOAT,
    HealthAndSafety     FLOAT,
    EducationalWellBeing     FLOAT,
    FamilyPeersRelationships     FLOAT,
    BehavioursAndRisks     FLOAT,
    SubjectiveWellBeing     FLOAT)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    LOCATION '/user/hue/UNICEF';
     
     
    Note that the directory has now been created: /user/hue/UNICEF If it was not there already. 
    You may now see it in the File Explorer, and here is an example below:
      
     

    1_NewDirectoryCreatedWithTable



    The next step is to upload some data to folders. For the purpose of showing what Hive does when you load up data files, let’s place a file into a different folder, called /user/hue/IncomeInequality/ and the file will be called ‘UNICEFreportcard.csv’ The syntax to load the file statically is here:

    LOAD DATA INPATH ‘/user/hue/IncomeInequality/UNICEFreportcard.csv’ OVERWRITE INTO TABLE `unicefreportcard`;

    You can see this here, and if it is not clear, you can click on the link to go to Flickr to see it in more detail:

    2_HiveInsertDataIntoTable

    Once you have executed the query, you can check that the data is in the table. This is easy to do, using the HUE interface. Simply to go the Beeswax UI, which you can see as the yellow icon at the top left hand side of the HUE interface, and then select the ‘Tables’ menu item from the grey bar just under the green bar. You can select the UNICEFreportcard table, and then choose ‘Sample’. You should see data, as in the example screenshot below:

    1_DataInsertedResult

    When the data is imported, where does it get stored for the external table?

    The UNICEFreportcard file was originally in the IncomeInequality folder, but now the original UNICEFreportcard.csv file is moved from the IncomeInequality folder. Where has it gone? It has been moved to the UNICEF folder, which was specified as the location when we created the external table.

    When we drop the table, the data file is not deleted. We can still see the CSV file in the UNICEF folder. This is straightforward to test; simply drop the table using the button, and then use File Explorer to go to UNICEF folder and you will see the CSV. Here it is:

    4_FileLocationforExternalFiles
     

    To test the separation of the data and the table metadata, simply re-run the query to create the table as before. Now, when you view a sample of the data, you see the data as you did before. When you dropped the table originally, you simply removed the table metadata; the data itself was stored. The reason for this is due to the fact that this table is an External table, and Hive preserves a separation between the data and the table metadata.

    If the table was an internal table, if you drop the table, the data will be deleted as well – so be careful!

    To summarise, in this post,we’ve looked at one way of getting data into Hive. Our next steps will be to do something with the data, now that it is in Hive. So let’s move on to do some simple querying before we proceed to start to visualise the data.

    I hope that helps!
    Jen

     
     

    Eating the elephant, one bite at a time: Partitioning in SQL Server 2012 and in Hive

    Hive and SQL Server offer you the facility to partition your tables, but their features differ slightly. This blog will highlight some of the main differences for you to look out for.
    What problems does partitioning solve? The problem arises due to the size of the tables. Therefore, the simplest solution is to divide the table up into smaller parts. These are called partitions. When a database table grows in size to the hundreds of gigabytes or more, it can become more difficult to complete actions such as: 
    * load new data
    * remove old data
    * maintain indexes 
    This is due to the size of the table, which simply means that the operations take longer since they must traverse more data. This does not only apply to reading data; it also applies to other operations such as INSERT or DELETE. 
    How does partitioning solve the problem? 
    In SQL Server, you have table partitioning features, which can make very large tables and indexes easier to manage, and improve the performance of appropriately filtered queries. If you’d like more information, see Joe Sack’s excellent presentation here
    The SQL Server Optimiser can send the query to the correct partition of the table, rather than sending the query to the whole table. Since the table is broken down into manageable chunks, it is easier for maintenance operations to manage a smaller number of partitions, rather than a massive lump of data in a table. SQL Server partitioning means that you need to use the partitioned keys everywhere in a partitioned table, so that the query can run more efficiently.

    SQL Server partitioning can be complex because it’s up to the SQL Server person, with their fingers on the keyboard, to decide how partitions will be swapped in and out, for example. This means that, often, two people might do two completely different things when it comes to deciding how the partitioning should go. As someone who often goes onsite to deliver projects, this means that you can sometimes see different things, done different ways.

    On the other hand, Hive does not use sophisticated indexes like many RDBMS which are able to answer queries in seconds. Hive queries tend to run for much longer, and are really intended for complex analytics queries.

    In SQL Server, what types of partitioning are available?
    There are two types of partitioning in SQL Server:
    Manually subdivide a large table’s data into multiple physical tables
    Use SQL Server’s table partitioning feature to partition a single table
    The first option is not available by default in SQL Server, and would require extensive DBA skills to implement.

    In Hive, it is possible to set up an automatic partition scheme at the point at which the table is created.  As you can see from the example below, we create partitioned columns as part of the table creation. This creates a subdirectory for each value in the partition column. This means that queries that have a WHERE clause will go straight to the relevant subdirectory and scan the subdirectory, therefore looking only at  a subset of the data. This means that the query returns the results faster, by use of the partition to ‘direct’ the query quickly towards the relevant data.What’s great about this, for me, is the consistency. Hive looks after the partition maintenance and creation for us.  Here is an example table script:

    CREATE EXTERNAL TABLE GNI (
        CountryName STRING,
        GNI INT
        )
    PARTITIONED BY (CountryID INT, Year INT);

    In Hive, partitioning key columns are specified as part of the table DDL and we can see them quite clearly in the PARTITIONED BY clause.  This helps Hive to use the folder structure to generate folders which are married to the partition. Here is an example of adding a partition using the command line:

    ALTER TABLE GNI ADD PARTITION(countryid = 1, year = 2008)
    LOCATION ‘user/hue/IncomeInequality/2008/1’;

    In Hive, partitioning data can help by making queries faster. If you work with data, you’ll know that users don’t like waiting for data and it’s important to get it to them as quickly as you can. How does it do this? Hive uses the directory structure to zoom quickly straight to the correct partition, based on the file structure. Here is the resulting file structure for our queries above:

    Partition Location in Hive

    You can also use the ‘Show Partitions ‘ command to show the partitions in a given table. Here is the result in Hive:

    Partition Show Command Result

    You can also use ‘DESCRIBE EXTENDED ‘ to provide partition information:

    DESCRIBE EXTENDED GNI;

    Here is the resulting file:

    Partition Describe extended

    Incidentally, the naming structure is flexible and you don’t need to locate partition directories next to one another. This is useful because it means that you could locate some data on some cheaper storage options, and keep other pieces of data elsewhere. This means that you could clearly indicate, by naming, what is your older data from your cheaper data.

    What are your options for storing data? In the Microsoft sphere, Windows Azure HDInsight Service supports both Hadoop Distributed Files System (HDFS) and Azure Storage Vault (ASV) for storing data. Windows Azure Blob Storage, or ASV, is a Windows Azure storage solution which provides a full featured HDFS file system interface for Blob Storage that enables the full set of components in the Hadoop ecosystem to operate (by default) directly on the data managed by Blog Storage. Blob Storage is not a relatively cheap solution, and storing data in Blob Storage enables the HDInsight clusters used for computation to be safely deleted without losing user data due to their separation.

    For example, you could store archive on HDFS and Azure, for example, by specifying different locations for your partitions. It is also an option to scale horizontally by simply adding further servers to the cluster, which is a simple ‘hardware’ oriented approach that people understand i.e. if you want more, buy a server, which seems to be a common route for solving problems, whether it is the correct one or not!

    In the next segment, we will start to look at HiveQL, which focuses on getting data into the filesystem and back out again. Being a Business Intelligence person, it’s all about the data and I hope you’ll join me for the next adventure.

    I hope that helps,
    Jen

    Eating the elephant one bite at a time: dropping databases

    In the last post, you learned how simple it is to create a database using Hive. The command is very similar to the way that we do this in SQL Server. As discussed, the underlying technology works differently, but ultimately they achieve the same end; database created.

    Now that you’ve created your database, how do you drop them again?  Fortunately, this is very simple to know how to do, if you’re already a SQL Server  aficionado.

    Here is an example:

    DROP DATABASE IncomeInequality;

    If you want to drop the database without  error messages if the database doesn’t exist, then you can use:

    DROP DATABASE IF EXISTS IncomeInequality;
     

    Dropping the database in Hive isn’t straightforward, however. In SQL Server, when you drop a database, it removes all of the database and the disk files used by the database. On the other hand, Hive will not allow you to drop a database if it contains tables. To get around this default behaviour, you have to add the CASCADE command to the command.

    DROP DATABASE IF EXISTS IncomeInequality CASCADE;

    Using this command will delete the tables first, and then drop the database.  Then, like SQL Server, its directory is deleted too.

    Once you’ve executed the command, you should double-check that the database has gone:

    SHOW DATABASES; 

    You can then see the result in Hue (more on this later! Let’s keep it small bites of the elephant for now!)

    ShowDatabasesResult

    Here, we have dropped the database and we only have the default left.

    In Hive, we will look at dropping tables in the next post. This is more complex than it first seems.
    I hope that helps!
    Jen