Getting started in Machine Learning: Google vs Databricks vs AzureML vs R

Executive Summary

MLpngMachine learning is high on the agenda of cloud providers. From startups to global companies, technology decision makers are watching the achievements of Google and Amazon Alexa with a view to implementing Machine Learning in their own organizations. In fact, as you read this article, it is highly likely that you have interacted with Machine learning in some way today. Organizations such as Google, Netflix, Amazon, and Microsoft have Machine learning as part of their services.  Machine Learning has become the ‘secret sauce’ in business and consumer facing spheres, such as online retail, recommendation systems, fraud detection and even Digital Personal Assistants such as Cortana, Siri and Amazon’s Echo.

The goal of this paper is to provide the reader with the tools necessary to select wisely between the range of open source, hybrid and proprietary machine learning technologies to meet the technical needs for providing business benefit. The report will offer support for devising a long-term strategy for machine learning for existing and future implementations. Here, we compare the following technologies:

  • Google Tensorflow
  • R
  • Databricks
  • AzureML
  • Google Cloud Machine Learning

 

Introduction and Methodology

A major challenge facing most organizations today is the decision whether to go open-source, hybrid, or proprietary with their technology vision and selection process.

Machine learning refers to a series of techniques where a machine is trained how to solve a problem. Machine Learning algorithms often do not require to be explicitly programmed, but they respond flexibly to the environment after receiving intensive training. Broader experiences work to improve the efficiency and the capabilities of the machine learning algorithms.  Machine Learning is proving immensely useful to help cope with the sheer speed of results required by the business, along with more advanced techniques.

The decisions on machine learning technology choice goes beyond regular technology choice, since it involves a leap of faith that the technology will offer the promised insights.  Machine Learning requires a process of creating, collating, refining, modelling, training and evaluating models on an ongoing process. It is also determined by how organizations intend to use machine learning technology.

Clearly, organizations see Machine Learning as a growing asset for the future, and they are adding the capability now. Machine Learning will increase in adoption in tandem with other opportunities in related technologies, such as Big Data and cloud technologies, and open source becomes more trusted in organizations. This data takes the form of clickstreams, and logs, sensor data from various machines, images and videos. The business teams will want to know more about deceptively simple business questions, where the answer lies in Big Data sources. However, data sources can be difficult to analyze for in. Using insights from this data, companies across various industries can improve business outcomes.

What opportunities are missed if it is not used? By adopting ML, enterprises are looking to improve their business, or even radically transform it. Organizations are potentially losing ground against competitors, if they are not working towards automation or machine learning in some way. They are also not making use of their existing historical data, and their data going forward.

In terms of recent developments, Machine Learning has changed to adapt to the new types of structured and unstructured data sources in the industry. It is also being utilized in real-time deployments. Machine Learning has become more popular as organizations are now able to collect more data, including big data sources, cheaply through cloud computing environments.

In this Roadmap, we will examine the options and opportunities available to businesses as they move forward into Machine Learning opportunities, with a focus on whether organizations should use open source, proprietary or hybrid solutions. The Roadmap focuses on the important decisions that the organization can make is on the choice of technology, and whether this should be open source, proprietary, or a hybrid architecture. The Roadmap also introduces a maturity map to investigate how the choice of machine learning technology can be influenced by the maturity of the organization in delivering machine learning solutions overall.

Evolution of Machine Learning

Though machine learning existed for a long time, it is the cloud that made the technology more accessible and usable to businesses of every size. The cloud offers a complete data storage solution with everything that machine learning needs to run, such as tools, libraries, code, runtime, models and data.

According to Google Trends, the term Machine Learning has increased in popularity six-fold since July 2012.  In response to this interest, established Machine Learning organizations are leading the way by provisioning their technology through open source. For example, Google has TensorFlow, the open source set of machine learning libraries that Google open sourced in 2015.  Amazon has made its Deep Scalable Sparse Tensor Network Engine (DSSTNE – pronounced ‘Destiny’) library available on GitHub under the Apache 2.0 license. Technology innovator and industry legend Elon Musk has ventured out with OpenAI, which bills itself as a ‘non-profit AI research company, discovering and enacting the path to safe artificial general intelligence.’ The technical community has a great deal of Machine Learning energy, evidenced by the fact that Google recently announced its acquisition of online data scientist community Kaggle, which has an established community of data scientists and potential employee pool, as well as one of the largest repositories of datasets that will help train the next generation of machine-learning algorithms.

Evolution of Open Source

Why has Open Source achieved so much prevalence in Machine Learning in recent years? Open Source has been a part of Machine Learning, right from its inception, but it has gained attention in recent years due to significant successes. For example, AlphaGo, produced using Torch software. AlphaGo’s victory over the human Go champion, Lee Sedol, wasn’t simply an achievement for artificial intelligence; it was also a triumph for Open Source software.

Evolution of Hybrid and Proprietary Systems

What problems are hybrid and proprietary systems trying to solve? The overarching theme is that proprietary organizations are aiming themselves at the democratization of data, or the democratization of artificial intelligence. This is fundamentally changing the Machine Learning industry, as organizations are taking Machine Learning away from academic institutions and into the hands of business users to support business decision making.

Proprietary solutions can be deployed quickly, and they are designed to be scalable and work at global scale. Machine Learning is de-coupled from the on-premises solution to a solution that can be easy to manage, administer and cost. Vendors must respond nimbly to these changes as data centers make the transition towards powerhouses for analytics and machine learning at scale.

In the battle for market share, innovation is expended at the cloud level to ensure that standards in governance and compliance are met, with government bodies in mind. As the threat of cybercrime increases, standards of compliance and governance have become a flashpoint for competition.

How are they distinguished? Open source refers to source code that is freely available on the internet, and is created and maintained by the technical community. A well-known example of open source software is Linux. On the other hand, proprietary software could also be known as closed-source software, which is distributed and sold under strict licensing terms and the associated copyright and intellectual property is owned by a specific organization. Hybrid architectures are based on a mix of open source and proprietary software.

Methodology

For this analysis, we have identified and assessed the relative importance of some key differentiators. These are the key technologies and market forces that will redefine the sector in which technologies will strive to gain an advantage. Technology decision makers can use the Disruption Vector analysis in choosing the approach that aligns with their business requirements.

Decision Makers can use the Disruption Vector analysis to support them in selecting the right technologies that best suit their own requirements.

Here, I assigned a score from 1 – 5 to each company for each vector. The combination of these scores and the relative weighting and importance of the vectors drives the company index across all differentiators.

Usage Scenarios

Machine learning, regardless of approach, has several common usage scenarios.

hand-3044387_1280

Finance

One popular use case of machine learning is Finance. As with other areas of Machine Learning, a trend perpetuated by more accessible computing power and more accessible machine learning tools and open source packages dedicated to finance. Machine Learning is pervasive in Finance in terms of business and consumer solutions. Machine Learning is used in activities such as approving loans, risk management, asset management, and currency forecasting.

The term robo-advisor is a new term, which was unheard of, five years ago.

Robo-advisors are used to adjust a financial portfolio to the goals and risk tolerance of the consumer based on factors such as attitude to saving, age, income, and current financial assets. The robo-advisor then adapts these factors to reach user’s financial goals.

Bots are also used in providing customer service in the Finance industry. For example, they can interact with consumers using natural language processing and speech recognition. The bot capability for language, combined with the robo-advisor capability for predicting factors that meet financial goals, mean that the Finance world is fundamentally impacted by machine learning from the customer perspective, and the finance professionals’ perspective as millennials fuel the uptake of automated financial planning services.

Healthcare

Why should enterprises care about healthcare? Enterprises have an interest in keeping healthcare costs low as the hard cost of healthcare premiums increase, and they can even impact the enterprises’ ability to invest in itself. Employee sickness costs affect productivity, and it is in the interests of the enterprise to invest in employee health. As an industry, U.S. health care costs were $3.2 trillion in 2015, making healthcare one of the country’s largest industries, equaling to 17.8 percent of US gross domestic product. Rising healthcare issues, such as diabetes and heart disease, are caused by lifestyle factors.

Machine learning is one of the tools being deployed to reduce healthcare costs. As the use of health apps and smartphones increases, there is increased data from the Internet of Things technology. This is set to drive health virtual assistants, which can take advantage of IoT data, increased natural language processing and sophisticated healthcare algorithms to provide quick patient advice for current health ailments, as well as monitor for future potential issues.

Machine Learning can assist healthcare in improving patient outcomes, preventative medicine and predicting diagnoses. In the healthcare industry, it is used for the reduction of patient harm events, reduction of hospital acquired infections, right through to more strategic inputs such as revenue cycle management, and patient care management. Machine Learning in healthcare specifically focuses on long-term and longitudinal questions, and helps to evaluate patient care through risk-adjusted comparisons. For the healthcare professional, it can help to have simple data visualizations which display the message of the data, and deploy models for repeated use to help patients long-term.

There are open source machine learning offerings which are aimed specifically at healthcare. For example, healthcare.ai is accessible to the thousands of healthcare professionals who do not have data science skills, but would like to use machine learning technology to help patients.

Marketing

Machine learning has a wide range of applications in marketing. These range from techniques to understand existing and potential customers, such as social media monitoring, search engine optimization and quality evaluation.

There are also opportunities to offer excellent customer service, such as tailoring customers, personalized customer recommendations and improved cross-selling and up-selling opportunities.

There are a few open source marketing tools which use machine learning. These include Datumbox Machine Learning Framework. Most machine learning tools aimed at marketing are proprietary, however, such as IBM Watson Marketing.

Key Differentiators

Machine learning decisions are crucial in developing a forward-thinking machine learning strategy that ensures success throughout the enterprise.

Well-known organizations have effectively used machine learning as a way of demonstrating prominence and technical excellence through flagship achievements. Enterprises will need a cohesive strategy that facilitates adoption of machine learning right across the enterprise, from delivering machine learning results from business users through to the technical foundations.

In this section, we discuss the five most important vectors that contribute to this disruption in the industry, which also correspond to factors that are crucial to consider when adopting machine learning as part of an enterprise strategy. The selected disruption vectors are focused on the transition of an organization towards the generation of a successful enterprise strategy of implementing and deploying machine learning technology.

The differentiators identified are the following:

  • Ease of Use
  • Core Architecture
  • Data Integration
  • Monitoring and Management
  • Sophisticated Modelling

Ease of Use

Technology name Approach Commentary Score
R Open Source Data Scientists 1
Databricks Hybrid Data Scientists but it has interfaces for the business user 4
Microsoft AzureML Hybrid Business Analysts through to Data Scientists 5
Google Cloud Machine Learning Proprietary Data Scientists 3
Google Tensorflow Proprietary Data Scientists 3

Generally, Machine Learning technology is still primarily aimed at data scientists to deliver and productize the technology, but there is a recognition that other roles can play a part, such as data engineers, devops engineers and business experts.

In this disruption vector, one clear differentiator between Microsoft and the other technologies is that they locate the non-technical business analyst at the front and center of the AzureML technology. In AzureML, more complex models can be built in R and loaded up to AzureML., with a drag-and-drop interface and embeddable R.  Databricks focus on a more end-to-end solution, which utilizes different roles at different points in the workflow. Databricks have different tools for different parts of the process. The need for a data scientist is balanced out by the provision of tools specifically targeted at the business analyst. Both AzureML and Databricks allow for the consumption of machine learning models by the business user. Google Cloud Machine Learning Engine, Google Tensorflow and the open-source R have firmly placed the development of machine learning models in the data scientist and developer spheres. As Google Tensorflow and R are both open-source, this is to be expected.

Google’s Cloud Machine Learning Engine combines the managed infrastructure of Google Cloud Platform with the power and flexibility of open-source Google TensorFlow. Google Cloud Machine Learning Engine has a clear roadmap in terms of people starting off in R and Google TensorFlow open source, and then porting those models into Google Cloud Machine Learning. RStudio, an R IDE, allows Tensorflow to be used with R, so this enables R models to be imported into Google Tensorflow, and then into Google Cloud Machine Learning Engine.

Business users can access their data through a variety of integrations with Google, including Informatica, Tableau, looker, Qlik, snapLogic and the Google Analytics 360 Suite. This means that Machine Learning is embedded in the user’s choice of interface for the data.

The risk for enterprises is that the multiple Google integration points introduce multiple points of failure and numerous points of complexity in putting the different Google pieces of the jigsaw together, which is further exacerbated with the presence of third-party integrations into tools which are visible from the user perspective. In the future, this scenario may change, however, as Google put Machine Learning at the heart of their data centers and user-oriented applications. The business user is seeing an increase of the presence of machine learning in their everyday activities, even including the creation and development of business documents. Google is aimed firmly at the data scientist and the developer, but it does offer its pre-build machine learning intelligence platform. That said, the competition is heating in this space as Google are now bringing machine learning to improve data visualization for business users so that they can make better use of their business data. Microsoft are also adding some machine learning into Microsoft Word for Windows, so that it now offers intelligent grammar suggestions.

Core Architecture

Machine Learning solutions should be resilience, robust and have a clear separation between storage and compute so that models are portable.

There are variations in the core technology which differentiate open source technologies from the large vendors. Embarrassingly parallel workloads separate a technical problem into many parallel tasks with ease. The tasks are run in parallel, with no interdependencies between the tasks or data. R does not naturally work easily with embarrassingly parallel workloads. Many scientific and data analysis require parallel workloads, and packages such as Snow, Multicore, RHadoop and RHIPE can help R to provision embarrassingly parallel workloads.

As an open-source technology, R is not resilient or robust to failure. R works in-memory, and it is only able to hold the data that resides in memory. Its power comes from open-source packages, but these can over-write one another’s functions. This can be an issue because it can cause problems at the point of execution, which can be difficult to troubleshoot without real support.

On the other hand, proprietary cloud-based machine learning solutions offer the separation between storage and compute with ease. Google Tensorflow can use the Graphical Processing Unit (GPUs) that Google has in place in its data centres, which Google are now rebranding as AI centres. To mitigate against technical debt, both Databricks and Google Cloud Machine Learning Engine both have a clear trajectory from the open source technology of Google Tensorflow towards the full enterprise solution which provides confidence for widespread enterprise adoption. As a further step, Databricks also allow porting models from Apache Spark ML and Keras, as well as other languages such as R, Python or Java.  As a further signal of the importance of the open source path with a view to reducing technical debt, Google have released a neural networking package, Sonnet, as a partner to Tensorflow to reduce friction to model switching during model development.

Technology name Approach Commentary Score
R Open Source No; liable to errors 1
Databricks Hybrid Reduce technical debt by being open to most tech 4
Microsoft AzureML Hybrid R and Python are embedded. Solution is robust. Not open to Tensorflow or other packages 4
Google Tensorflow Open Source APIs are not yet stable. 2
Google Cloud Machine Learning Engine Proprietary Cloud architecture with a clear onboarding process of open source technology 4

Data Integration

Data is more meaningful when it is put with other data. Here is the ways in which the technologies differentiate:

Technology name Approach Commentary  
R Not Open R is one of the spoke programming languages, but it is not a hub in itself. 1
Databricks Highly Open Databricks is highly open, facilitating SQL, R, Python, Scala and

Java. It also facilitates machine learning frameworks/libraries such as Scikit-learn, Apache Spark ML, TensorFlow,

Keras.

5
Microsoft AzureML Moderately Open R and Python are embedded. Solution is robust 3
Google Tensorflow Open Source Tensorflow offers a few APIs but these are not yet stable. Python is considered stable, but the others, C++, Java and Go are not considered stable. 3
Google Cloud Machine Learning Engine Proprietary APIs are offered through Tensorflow 3

To leverage Machine Learning on the cloud without significant rework, solutions should support data import to machine learning systems. We need to see improved support for databases, time period definition, referential integrity and other enhancements.

The Machine Learning models run in IaaS and PaaS environments, which consume the APIs and services exposed by the cloud vendors and produce an output of data, which can be interpreted as the results. The cloud environment prevents portability of workloads, and organizations are concerned about vendor lock-in of cloud platforms for machine learning.

The modelling process itself involves taking a substantial amount of inbound data, analyzing it, and determining the best model. The machine learning also needs to be robust and recover gracefully from failures during processing, such as network failures.

In terms of the enterprise transition to the cloud for machine learning, it should not impact the optimization of the machine learning technology, and it should not impact the structures used or created by the machine learning process.

Machine learning solutions should be able to ingest data in different formats, such as JSON, Xml, Avro and Parquet.  The models themselves should be able to be in a portable format, such as PMML.

The range of modelling approaches within data science means that data scientists can approach a modelling problem in many ways. For example, data scientists can use scikit-learn, Apache Spark ML, TensorFlow or Keras. Data Scientists can also choose from a number of different languages: SQL, R, Python, Scala or Java, to name a few.

Of all the packages and frameworks, Databricks scored the best in terms of data integration. Dependent on the skill set, data scientists can use scikit-learn, Apache Spark ML, TensorFlow or Keras. Data Scientists can also choose from several different languages – R, Python, Scala or Java.  AzureML and Google Cloud Machine Learning Engine are more restrictive in terms of their onboarding approach. AzureML will integrate with R and Python languages, and it will ingest data from Azure data sources and Odata. Google Cloud Machine Learning Engine will accept data that it serialized in a format that Tensorflow can accept, normally CSV format. Further, like Azure ML, Google data has to be stored in a location where the data can be accessed, such as Google BigQuery or Cloud Storage.

AzureML’s dependency on R may be the popular choice, but there is a risk of inheriting issues in R code that is not high quality. The sheer volume of R packages is often given as a reason to adopt R. However, these does not mean that the packages are quality; some of these packages are small and not developed often, but they are still maintained on the R repository, CRAN, with no real distinction between these packages and the well-maintained, more robust packages. Google Tensorflow, on the other hand, has a large, well-maintained library which is extending to Sonnet.

Monitoring and Management

What you don’t monitor, you can’t manage.

Technology name Score
R 1
Databricks 5
Microsoft AzureML 5
Google Tensorflow 2
Google Cloud Machine Learning Engine 5

Machine Learning has a rich set of techniques to develop and optimise all aspects of Machine Learning. This ranges from the onerous task of cleaning the inbound data, to deploying the final model to production.

As Machine Learning becomes embedded in the enterprise data estate, it should also be robust. Machine learning models should not be able to run out of space. Instead, machine learning solutions should be able to accommodate elasticity of cloud-based data storage. Since many queries will span clouds and on-premises as the business requirements for expanded data sources will increase, the machine learning solution needs to keep up with ‘data virtualisation’ requirement.

As a result of its natural operation, machine learning modelling and data processing can be time-consuming. If the machine learning algorithm takes extensive delay to conduct functions such as repartitioning data, scaling up or down, or copying large data sources around, then this will add unnecessary delay to the machine learning model processing. From the business perspective, lengthy machine learning process will adversely impact user adoption, subsequently introducing an obstacle to business value. The more intervention required by the machine learning solution needs to support its operation, then the less automated and less elastic the solution it becomes. It will also mean that the business isn’t taking advantage of the key benefits of cloud. Systems that are built for the cloud should dynamically adapt to the business need.

If the machine learning solution is built for the cloud, then there should be efficient separation of compute and storage. This means that the storage and the compute can be managed and costed independently. In many cases, the data will be stored, but the actual machine learning model creation and processing will require bursts of processing and model usage. This requirement is perfect for cloud, allowing customers to take advantage of cloud computing so that they can dynamically adapt the technology to meet the business need.

R has no ability to monitor itself, and issues are resolved by examining logs and restarting the R processes. There is no obvious way to identify where the process started or stopped processing data so it is best to simply start again, which means that time can be lost. Both AzureML and Google Cloud Machine Learning provide user-focused management and monitoring via a browser based option, with Google providing a command line option too.

Sophisticated Modelling

Data modelling is still a fundamental aspect of handling data. The technologies differ in terms of their levels of sophistication in producing models of data.

Technology name Approach Commentary  
R Open Source No; liable to errors 1
Databricks 5
Microsoft AzureML Hybrid R and Python are embedded. Solution is robust 4
Google Tensorflow Proprietary 5
Google Cloud Machine Learning Engine Proprietary

Google Cloud Machine Learning Engine allows users to create models with Tensorflow, which are then onboarded to produce cloud-based models. TensorFlow can be used via Python or C++ APIs, while its core functionality is provided by a C++ backend. The Tensorflow library comes with a range of in-built operations, such as matrix multiplications, convolutions, pooling and activation functions, loss functions and optimizers. Once a graph of computations has been defined, TensorFlow executes it efficiently across different platforms.

The Tensorflow modelling facility is flexible, and it is aimed at the deep learning community. Tensorflow is created in well-known languages of Python and C++ so there is a fast ramp-up of skill sets to reach a high level of Tensorflow model sophistication. Tensorflow allows data scientists to roll their own models but they will need a deep understanding of machine learning and optimization in order to be effective in generating models.

Azure

Azure ML Studio is a framework to develop machine learning and Data Science applications. It has an easy to use graphical interface that allows you to quickly develop machine learning apps. It saves you a lot of time by making easier to do tasks like data cleaning, feature engineering and test different ML algorithms. It allows to add scripts in python and R and also includes deep learning.

Further, AzureML comes ready-prepared with some pre-baked machine learning models to help the business analyst on their way. AzureML offers in-built models, but it is not always very clear how the models got to their results. As the data visualization aspect grows in Power BI and in the AzureML Studio, this is a challenge which will be handled in the future.

Company Analysis

In the recent history of the industry, Sophos acquired Invincia, Radware bought Seculert and Hewlett Packard bought Niara.

R

R is possibly the most well-known data science open source tool. It was developed in academia, and has had widespread adoption throughout the academic and business community.  R has a range of machine learning packages, which are downloadable from the CRAN repository. These include MICE for imputing values, rpart and caret for classification and regression models, and PARTY for partitioning models. It is completely open source, and it forms part of the offerings discussed in this company analysis.

When did R start? Who are the R customers? Are they a leader, in terms of installations? Average size of customers? Enterprise scale customers? What are customers’ concerns and benefits? Are companies just flirting with R because it’s the buzzword at the moment?

Google

Google is appealing to enterprises as it solves many different enterprise technology solutions from infrastructure consolidation and data storage right through to business user-focused solutions in Google office technologies. As an added bonus, enterprise customers can leverage Google’s own machine learning technology which underpins functionality such Google Photos.

Tensorflow is open source, and it can be used on its own. Tensorflow appears in Google Cloud Machine Learning capabilities, which is a paid solution. Google also has a paid offering, which clearly chains together cloud APIs with machine learning, and unstructured data sources such as video, images and speech.

Google’s Tensorflow has packages aimed at verticals, too, such as Finance and cybercrime. It is also used for social good projects. For example, Google made the headlines recently when a self-taught teenager used Google Tensorflow to diagnose breast cancer.

Comparison

Now that Databricks is now in Azure, it is well worth a look for streamlined data science and machine learning at scale. This will appeal more to the coders, but this brings the benefit of customization. AzureML is a great place to get started, and many Business Intelligence professionals are moving into Data Science via this route.

Open Source tools such as R and Google’s Tensorflow are not enterprise tools. They are missing key features, such as the ability to connect with a wide range of APIs. Further, it does not have key enterprise features such as security, management, monitoring and optimization. It is projected that Tensorflow will start to have more of these enterprise features in the future. Also, organisations do not always want open source used in their production systems.

Despite an emphasis on being proprietary, both IBM Watson and Microsoft offer free tiers of their solutions, which are limited by size. In this way, both organizations compete with the open source, free offerings with the bonus of robust cloud infrastructure to support the efforts of the data scientists. Databricks offer a free trial, which converts into a paid offering. The Databricks technology is based on Amazon, and they distinguish between data engineering and data analytics workloads. The distinction may not always be clear to business people, however, as it is couched in terms of UIs, APIs and notebooks and these terms will be more meaningful to technical people.

Future Considerations

In the future, there will need to be more reference architecture for common scenarios. In addition, best practice design patterns will become more commonplace as the industry grows in terms of experience.

How do containers impact machine learning? Kubernetes is an open source container cluster orchestration system. Containers allow for the automation and deployment, scaling, of operations, and it’s possible to envisage machine learning manifested in container clusters. Containers are built for a multi-cloud world: public, private, hybrid. Machine Learning is going through a data renaissance, and there may be more to come. From this perspective, the winners will be the organizations who are most open to changes in data, platform and other systems, but this does not necessarily mean open source. Enterprises will be concerned about issues which are core to other software operations, such as robustness, mitigation of risk and other enterprise dependencies.

Conclusion

Machine learning is high on the agenda of cloud providers. From startups to global companies, technology decision makers are watching the achievements of Google and Amazon Alexa with a view to implementing Machine Learning in their own organizations. In fact, as you read this article, it is highly likely that you have interacted with Machine learning in some way today. Organizations such as Google, Netflix, Amazon, and Microsoft have Machine learning as part of their services.  Machine Learning has become the ‘secret sauce’ in business and consumer facing spheres, such as online retail, recommendation systems, fraud detection and even Digital Personal Assistants such as Cortana, Siri and Amazon’s Echo.

From the start of this century, machine learning has grown from belonging to academia and large organizations who could afford it, to a wide range of options from well-known and trusted vendors who propose a range of solutions aimed at small and large organizations alike. The vendors are responding to the driving forces behind the increasing demand for machine learning solutions as organizations are inspired by the promise that Machine Learning offers, the accessibility offered by cloud computing, open source machine learning and big data technologies as well as perceived low cost and easily-accessible skills.

In this paper, we with the tools necessary to select wisely between the range of open source, hybrid and proprietary machine learning technologies to meet the technical needs for providing business benefit. The report has offered some insights into how the software vendors stack up against one another.

Dynamic Data Masking in Azure SQL Datawarehouse

I’m leading a project which is using Azure SQL Datawarehouse, and I’m pretty excited to be involved.  I love watching the data take shape, and, for the customer requirements, Azure SQL Datawarehouse is perfect.

secret-3037639_640 Note that my customer details are confidential and that’s why I never give details away such as the customer name and so on. I gain – and retain – my customers based on trust, and, by giving me their data, they are entrusting me with detailed information about their business.

One question they raised was in respect to dynamic data masking, which is present in Azure SQL Database. How does it manifest itself in Azure SQL Datawarehouse? What are the options regarding the management of personally identifiable information?

sasint

As we move ever closer to the implementation of GDPR, more and more people will be asking these questions. With that in mind, I did some research and found there are a number of options, which are listed here. Thank you to the Microsoft people who helped me to come up with some options.

1. Create an Azure SQL Database spoke as part of a hub and spoke architecture.

The Azure SQL Database spoke can create external tables over Azure SQL Datawarehouse tables for moving data into Azure SQL Database to move data into the spoke. One note of warning: It isn’t possible to use DDM over an external table, so the data would have to move into Azure SQL Database.
2. Embed masking logic in views and restrict access.

This is achievable but it is a manual process.
3. Mask the data through the ETL processes creating a second, masked, column.

This depends on the need to query the data. Here, you may need to limit access through stored procs.
On balance, the simplest method overall is to use views to restrict access to certain columns. That said, I an holding a workshop with the customer in the near future in order to see their preferred options. However, I thought that this might help someone else, in the meantime. I hope that you find something that will help you to manage your particular scenario.

How do you know if your org is ready for Data Science? Starting your journey with Azure Databricks

Mentioning  data science at your company may give you an air of expertise, but actually implementing enterprise-wide transformation with data science, artificial intelligence or deep learning is a business-wide transformation activity. It impacts your data and analytics infrastructure, engineering and business interactions, and even your organizational culture. In this post, we will look at a few high-level things to watch out for before you get started, along with a suggestion that you can try Azure Databricks as a great starting point for  your cloud and data science journey.

Note: not all companies are ready for data science. Many of them are still struggling with Excel. This article is meant for you.

So how can you move forward?

1. Have a data amnesty

If you’re still struggling with Excel, then data science and AI can seem pretty far off. Have a data amnesty – ask everyone to identify their key data sources so you can back them up, protect them, and share them better where appropriate.

2. Determine the Data (Im)maturity in your organization.

Take a peep at the following table: where is your organization located?

Democratization of Data

Note: this idea was inspired by Bernard Liautaud

Ideally, you’re headed towards a data democracy, where IT are happy in their guardianship role, and the users have got their data. If this equilibrium isn’t in place, them this could potentially derail your budding data science project. Working on these issues can help your success to be sustainable in the longer-term.

3. All that glitters isn’t data gold

This is the ‘shiny gadget’ syndrome. Don’t get distracted by the shiny stuff. You need your data vegetables before you can have your data candy.

Focus on the business problem you’re trying to solve, not the technology. You will need to think about the success criteria.

You should be using the technology to improve a business process, with clear goals and measurable success. Otherwise it can be disorganized, with a veil of organization that is disguised by the technology.

contacts-3079618_1920

4. Fail to plan, plan to fail

If you fail… that’s ok. You learned ten things you didn’t know before. Next time, plan better, scope better, do better.

How to get started?

Starting in the cloud is a great way to get started. It means that you’re not purchasing a lot of technology and hardware that you don’t need. Abraham Maslow was once quoted as saying “If you only have a hammer, you tend to see every problem as a nail.” Those words are truer than ever as an increasingly complex and interconnected world makes selecting the right tools for the data estate. With that in mind, the remainder of this blog talks about Azure Databricks as a step for data science for the new organization in order to reduce risk, initial outlay and costs.

 

 

What is Microsoft Azure Databricks?

default-open-graphAzure Databricks was designed in collaboration with Microsoft and the creators of Apache Spark. It is designed for easy data science: one-click set up, streamlined workflows and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. Each of this roles will have a different style describes how users want to interact with, present, and share information, bearing in mind the varying skillsets of both business users and IT.

So what is Apache Spark? According to Databricks, Apache Spark is the largest open source process in data processing. From the enterprise perspective, Apache Spark has seen rapid adoption by enterprises across a wide range of industries.

So what does Apache Spark give you? Apache Spark is a fast, in-memory data processing engine. For the serious data science organisation, it allows developers to use expressive development APIs to work with data. For information and data workers, they have the ability to execute streaming analytics, longer-term machine learning or SQL workloads – fast. Implemented in Azure, it means that the business users can use Power BI in order to understand their data better.

Apache Spark consists of Spark Core and a set of libraries. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed ETL application development. Spark lets you quickly write applications in Java, Scala, or Python. It comes with a built-in set of over 80 high-level operators. And you can use it interactively to query data within the shell.

The Apache Spark functionality is incorporated in Azure Databricks

In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use case.

It supports in-memory processing to boost the performance of big-data analytic applications, and it works with other Azure data stores such as Azure SQL Data Warehouse, Azure Cosmos DB, Azure Data Lake Store, Azure Blob storage, and Azure Event Hub.

What is so special about Apache Spark, anyway?

For the enterprise and data architects, it can give you the opportunity to have everything in one place: streaming, ML libraries, sophisticated analytics, data visualization. It means that you can streamline in one technological umbrella, but have your data in other data sources such as Azure SQL Data Warehouse, Azure Cosmos DB, Azure Data Lake Store, Azure Blob storage, and Azure Event Hub.

As an architect, I aim to reduce points of failure and points of complexity, so it is the neatness of the final streamlined technology solution that is appealing.

It is also fast, and people want their data fast. Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk. Spark makes it possible by reducing number of read/write to disc. It stores this intermediate processing data in-memory. It uses the concept of Resilient Distributed Dataset (RDD), which allows it to transparently store data on memory and persist it to disc only it’s needed. This helps to reduce most of the disc read and write the main time consuming factors of data processing.

 

Data Visualization for the Business User with Azure Databricks as a basis

Azure Databricks brings multi-editable documents for data engineering and data science in real-time. It also enables dashboards with Power BI for accurate, efficient and accessible data visualization across the business.

Azure Databricks is backed by Azure Database and other technologies that enable highly concurrent access, fast performance and geo-replication, along with Azure security mechanisms.

Summary

Implementing enterprise-wide transformation with data science, artificial intelligence or deep learning is a business-wide transformation activity. In this post, there is the suggestion that you can try Azure Databricks as a great starting point for  your cloud and data science journey, with some advice on getting a good ground before you start.

Book Review: Human Resource Management for MBA and Business Masters

51SbuZ0HJzL._SX382_BO1,204,203,200_
Human Resource Management for MBA and Business Masters
by Iain Henderson

My rating: 5 of 5 stars

The book Human Resource Management for MBA and Business Masters students offered an excellent introduction to the study of HRM, and I thoroughly recommend it as a way of cutting through the MBA material on HRM.
If you find the Marchington text on Human Resource Management at work a bit dense, then I’d recommend reading the Henderson book first. MBA students will be familiar with this book, which is a ‘big brother’ to the book by Henderson which I’m reviewing here.
Human Resource Management at Work
Henderson’s book distilled the main points of the debates, and it helped to cut through some of the noise before I attempted to go back and read the Marchington book. In other words, it was good to have the ‘skinny’ first before going back to read the Marchington book.
I am a visual learner and I liked the fact that Henderson’s book had diagrams. There were also some case studies, which are useful for my particular learning style because I could remember the examples.
This book is published by the CIPD and I’m glad to say that they did a good job of making this topic accessible. Thank you to Iain Henderson and the CIPD team.

I’m hoping that this academic knowledge will come in useful for my customers, and also for myself when I look at hiring people again. So there is a practical application to acquiring this knowledge, and I am looking forward to using it in the future.

shield-1020318_640

View all my reviews

 

Book Review: Grokking Algorithms: An Illustrated Guide For Programmers and Other Curious People

Grokking Algorithms An Illustrated Guide For Programmers and Other Curious PeopleGrokking Algorithms An Illustrated Guide For Programmers and Other Curious People by Aditya Y. Bhargava

My rating: 5 of 5 stars

I’ve just finished reading the Manning book called Grokking Algorithms An Illustrated Guide For Programmers and Other Curious People

This is a very readable book, with great diagrams and a very visual style. I recommend this book for anyone who wants to understand more about algorithms.
This is an excellent book for the budding data scientist who wants to get past the bittiness of learning pieces of open source or proprietary software here and there, and wants to learn what the algorithms actually mean in practice. It’s fairly easy to get away with looking like a real Data Scientist if you know bits of R or Python, I think, but when someone scratches the surface of that vision, it can become very apparent that the whole theory and deeper understanding can be missing. This book will help people to bridge the gap from learning bits here and there, to learning what the algorithms actually mean in practice.
Recommended. I’m expecting to find that people might ‘pinch’ the diagrams but I’d strongly suggest that they contact the author and credit appropriately.
I’d recommend this book, for sure. Enjoy!

View all my reviews

Open Source Decency Charter Proposal for Dealing with Harassment at Technical Events

rawpixel_alphabets-2518268_1920

If you’re reading this, you are probably a decent person. You shouldn’t read this thinking that you will be putting yourself in danger if you attend a tech event. I can tell you that I normally feel pretty safe at these events and you can read my story here and I’ve talked about it publicly since I want to do something good with it. Note that I don’t represent any other organization or body or person with this blog. It’s another heartdump.

Most people are pretty decent but what do you do about the ones that are not? How do victims know what to do? How do you know how to help one of your friends?

The vast majority of people want to help and are decent, and that’s why I’d like to propose the creation of an open source Decency Charter to help at technical community events which need support for handling harassment at events.

A Decency Charter would outline reasonable and decent expectations for participants within the a technical community event, both online and in-person, as well as steps to reporting unacceptable behavior and concerns. It’s fairly simple at heart: be decent to one another.

I think that it would be good to have to have something very clear in place that people can use as a template, so everyone can have a voice and feel safe. That’s why I think an open source Decency Charter is a good suggestion and I’d be interested in your thoughts.

This blog post is an attempt to bring a few strands together; namely diversity, harassment in the technical community, and a proposal for a way forward.

It’s a shame that we have to encode decency into technical events.  More and more workplaces are being embroiled in sexual harassment cases. According to the Trades Union Congress (TUC) in 2017, over 50% of workplaces have had an issue with sexual harassment. I think it would be good if people could adopt a Decency Charter, since it sounds more positive than a Code of Conduct. The inspiration came from Reid Hoffman, who talked about a Decency Pledge in his article The Human Rights of Women Entrepreneurs where he talks about sexual harassment of women in the industry. I’m grateful to Reid Hoffman for his article because it does help to have male voices in these discussions. Simply put, his voice will carry further than mine, and with way more credibility.

Followers of my blog will know that I’m trying to get support for a Diversity Charter to support diversity at events. As an additional add-on, I’d like to propose a Decency Charter as well, which gives people a template that they can use and amend to monitor their event, as they see fit. I’d love your ideas and please do email me at jen.stirrup@datarelish.com with your thoughts, or leave a comment on this blog.

I am going to start to list a few things here from the viewpoint of someone whose head is bloodied, but unbowed and I want to use my voice. Everyone’s experience is different but I thought that this might help in shaping a Decency Charter that sits alongside a Diversity Charter. So, what do I actually want?

telephoneaged-2974648_1920

As a starter for ten:

I want to feel safe and comfortable – Make it easy. I don’t have to have to think about it too hard if something happens to me or one of my friends – I need something that is so easy that I don’t have to look far to know what to do. I need to know what to do when something happens. I want to have a ‘home’ to go to, if something happens – that can be a location, or a person to call. I want to talk to someone. I want a number to call that is very visible on my event pass or pack so I can find it easily. I don’t want to google around for a form to fill in because that introduces a delay when it goes to an organizer, plus I am worried about putting my concerns about an individual or an event down in writing in case it gets in the wrong hands. This won’t secure my safety after the event, and that worries me, too. If I make a complaint, I can’t be sure that it would be successfully resolved and all relevant data removed, or handled confidentially. Google forms are so easily digested and forwarded by email and, like feathers, it could spread. I just want to talk to someone, in my own time. So, before, during and after the event, I’d ideally like each event to have a named panel of people who will listen to my concerns and they can act upon them in a clearly documented way.

I want others to feel safe and comfortable – I expect people to be able to answer accusations made about them. I don’t want people to think that the Microsoft Data Platform community, for example, is some den where there is a lot of harassment. There isn’t, but I’d like to see a Decency Charter in place in case there is.

I want to have a voice – I don’t want my voice taken away from me. I don’t want other people to speak for me. It’s easy for people to propose things without asking victims what they want, it’s very easy to dictate an approach from a point of privilege.

I want other people to have a voice – because everyone should be allowed to speak for themselves.

I expect confidentiality. I don’t expect people to repeat private details or rumours. At best, it immediately breeds distrust and you will never earn it back. At worst, you can deeply impact someone’s life by handling issues insensitively, and this cuts both ways. An accusation can’t be a condemnation, and there also has to be a balance with protecting people at the same time. Gossip doesn’t make me trust your processes in resolving things, and it has to be well thought out from all angles.  People can see how people behave with one another, and it’s a halo effect.

I expect you not to judge.

I expect to be able to get help right now, and have event organizers and volunteers who can support me if I need it. This is simply making sure that event volunteers are trained in knowing who to alert when something happens and responding thoughtfully and without judging, and, ultimately, centred on sensitivity.

I expect to be able to get help after the event, and have event organizers and volunteers who can support me if I need it.  I think that having an easily-available contact in place, well after the event, would be a good step. Event organizers usually have to clear things up well after an event, so this isn’t an onerous issue at all.

So how could this shape up?

I’d like to propose that, along with the Diversity Charter, we roll out an accompanying Decency Charter, similar to OpenCon Community Values or  the PASS Anti-Harassment policy. The PASS one is a good model but it only affects PASS events, and I’d like it to be an ‘open source’ way forward for community models. I think that, if we offered a ‘package’ of a Diversity Charter plus accompanying Decency Pledge, then the community have a template of ‘add-ons’ that they can choose to flex and use for their own events. They are absolutely welcome to change and adapt as they feel fit. I think it would be great to get a version 1.1. out there for the community to review and we can see what changes I get back.

What problem does this solve?

People don’t know where to start so we can give them a hand up.

As part of the speaker selection process, speakers can submit their past speaking experience as part of the speaker selection process. Organisers can choose to follow up with those past events to see if there are any issues with speakers; in any case, they should be doing their due diligence on speaker selection anyway, so it should not cost much effort  just to ask if there were any other issues that they should know about. It’s hard to deal with attendees because they are harder to police, and they can provide anonymous details at the point of registration. However, sending a signal with a robust Decency Pledge would send a message before people turn up to the event, and they should agree to adhere to it as part of the event registration process.

It’s so much easier to talk facts to someone, which is why I think organizers can offer contact details in case anyone wants to get in touch with them after the event.

Here are some resources to follow up:

PASS Summit Anti Harassment Policy

Enforcing a Code of Conduct

Responding to Reports of Harassment at Tech Events

I also want to add these resources in case this blog triggers anyone:

Male Rape and Sexual Abuse – because men can be victims, too.

Supporting a Survivor. 

I wanted to put this poem here, which is Invictus by William E. Henley:

Out of the night that covers me,
Black as the pit from pole to pole,
I thank whatever gods may be
For my unconquerable soul.

In the fell clutch of circumstance
I have not winced nor cried aloud.
Under the bludgeonings of chance
My head is bloody, but unbowed.

Beyond this place of wrath and tears
Looms but the Horror of the shade,
And yet the menace of the years
Finds, and shall find me, unafraid.

It matters not how strait the gate,
How charged with punishments the scroll,
I am the master of my fate:
I am the captain of my soul.

You’ve got this.

I’d love to know what you think. Please contact me at jen.stirrup@datarelish.com and I’ll be pleased to know your thoughts.

metoo-2859980_1920

Roundup of 2017 Presentations, and what’s next for 2018?

I’ve listed out some of my key speaking engagements for 2017. I am sure that I’ve done more events but this is a good start – often I am so busy that I drop things very quickly after I’ve ticked the box and done it. I’ve noted that I’m speaking to larger audiences over the years, but I’m doing less speaking events overall because I just simply can’t do all of the events that I’d like to do, so I’ve had to focus better.

I’ve also diversified the locations of my presentations. I was delighted to go to Dubai to present, and I will be doing more over in the Gulf this year. I’m doing a session just before Christmas, and I’ll release more details about that event shortly. I’m also lining up more events in the Gulf next year, because I had such a great experience speaking in Dubai recently. I spoke at a private event in Singapore earlier this year and I’ve been invited back for 2018, and hopefully I can do one of the data or SQL Server meetups. I’ve also been invited back to Jersey for a more indepth session, and I’ll be glad to do that, too.

For 2018, I’m hoping to do more large events and to do more online sessions as well. My recent Python webinar was very well received and I like the longevity of having sessions up on YouTube.

2017