Mentioning data science at your company may give you an air of expertise, but actually implementing enterprise-wide transformation with data science, artificial intelligence or deep learning is a business-wide transformation activity. It impacts your data and analytics infrastructure, engineering and business interactions, and even your organizational culture. In this post, we will look at a few high-level things to watch out for before you get started, along with a suggestion that you can try Azure Databricks as a great starting point for your cloud and data science journey.
Note: not all companies are ready for data science. Many of them are still struggling with Excel. This article is meant for you.
So how can you move forward?
1. Have a data amnesty
If you’re still struggling with Excel, then data science and AI can seem pretty far off. Have a data amnesty – ask everyone to identify their key data sources so you can back them up, protect them, and share them better where appropriate.
2. Determine the Data (Im)maturity in your organization.
Take a peep at the following table: where is your organization located?
Note: this idea was inspired by Bernard Liautaud
Ideally, you’re headed towards a data democracy, where IT are happy in their guardianship role, and the users have got their data. If this equilibrium isn’t in place, them this could potentially derail your budding data science project. Working on these issues can help your success to be sustainable in the longer-term.
3. All that glitters isn’t data gold
This is the ‘shiny gadget’ syndrome. Don’t get distracted by the shiny stuff. You need your data vegetables before you can have your data candy.
Focus on the business problem you’re trying to solve, not the technology. You will need to think about the success criteria.
You should be using the technology to improve a business process, with clear goals and measurable success. Otherwise it can be disorganized, with a veil of organization that is disguised by the technology.
4. Fail to plan, plan to fail
If you fail… that’s ok. You learned ten things you didn’t know before. Next time, plan better, scope better, do better.
How to get started?
Starting in the cloud is a great way to get started. It means that you’re not purchasing a lot of technology and hardware that you don’t need. Abraham Maslow was once quoted as saying “If you only have a hammer, you tend to see every problem as a nail.” Those words are truer than ever as an increasingly complex and interconnected world makes selecting the right tools for the data estate. With that in mind, the remainder of this blog talks about Azure Databricks as a step for data science for the new organization in order to reduce risk, initial outlay and costs.
What is Microsoft Azure Databricks?
Azure Databricks was designed in collaboration with Microsoft and the creators of Apache Spark. It is designed for easy data science: one-click set up, streamlined workflows and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. Each of this roles will have a different style describes how users want to interact with, present, and share information, bearing in mind the varying skillsets of both business users and IT.
So what is Apache Spark? According to Databricks, Apache Spark is the largest open source process in data processing. From the enterprise perspective, Apache Spark has seen rapid adoption by enterprises across a wide range of industries.
So what does Apache Spark give you? Apache Spark is a fast, in-memory data processing engine. For the serious data science organisation, it allows developers to use expressive development APIs to work with data. For information and data workers, they have the ability to execute streaming analytics, longer-term machine learning or SQL workloads – fast. Implemented in Azure, it means that the business users can use Power BI in order to understand their data better.
Apache Spark consists of Spark Core and a set of libraries. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed ETL application development. Spark lets you quickly write applications in Java, Scala, or Python. It comes with a built-in set of over 80 high-level operators. And you can use it interactively to query data within the shell.
The Apache Spark functionality is incorporated in Azure Databricks
In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use case.
It supports in-memory processing to boost the performance of big-data analytic applications, and it works with other Azure data stores such as Azure SQL Data Warehouse, Azure Cosmos DB, Azure Data Lake Store, Azure Blob storage, and Azure Event Hub.
What is so special about Apache Spark, anyway?
For the enterprise and data architects, it can give you the opportunity to have everything in one place: streaming, ML libraries, sophisticated analytics, data visualization. It means that you can streamline in one technological umbrella, but have your data in other data sources such as Azure SQL Data Warehouse, Azure Cosmos DB, Azure Data Lake Store, Azure Blob storage, and Azure Event Hub.
As an architect, I aim to reduce points of failure and points of complexity, so it is the neatness of the final streamlined technology solution that is appealing.
It is also fast, and people want their data fast. Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk. Spark makes it possible by reducing number of read/write to disc. It stores this intermediate processing data in-memory. It uses the concept of Resilient Distributed Dataset (RDD), which allows it to transparently store data on memory and persist it to disc only it’s needed. This helps to reduce most of the disc read and write the main time consuming factors of data processing.
Data Visualization for the Business User with Azure Databricks as a basis
Azure Databricks brings multi-editable documents for data engineering and data science in real-time. It also enables dashboards with Power BI for accurate, efficient and accessible data visualization across the business.
Azure Databricks is backed by Azure Database and other technologies that enable highly concurrent access, fast performance and geo-replication, along with Azure security mechanisms.
Summary
Implementing enterprise-wide transformation with data science, artificial intelligence or deep learning is a business-wide transformation activity. In this post, there is the suggestion that you can try Azure Databricks as a great starting point for your cloud and data science journey, with some advice on getting a good ground before you start.
One thought on “How do you know if your org is ready for Data Science? Starting your journey with Azure Databricks”