Recently, I gave a Make Your Data Work Monday webinar on the complexities of the data sources for data science in Azure, and I thought it important enough to turn into an actual post. If you prefer video, you can catch the video here.
How can you differentiate the different opportunities to store your data in Azure?
Now, it may not sound that interesting, but it’s crucial; if you don’t have your data, then you don’t have your business. And many technologies such as Artificial Intelligence or Data Science absolutely must have the data in place and the right format. So how do you choose which data source is right for your needs?
Organizations need to recast storing their data. It is more than just some giant USB stick in the sky that’s going to store all of the data. It has a lot of services that you can use, such as Big Data analytics. To get the best of technology such as Artificial Intelligence or Data Science, you really, really must have your data in the right format, and a good place.
In this topic, we’re going to focus specifically on Big Data technologies, such as your HDInsight, Apache Spark and Databricks. HDInsight allows us to do Big Data processing. We can tear HDInsight up, and then tear it back down again when we don’t need it. HDInsight will enable us to do a range of processing on Hadoop technology, and now we have access to that technology in Azure. HDInsight gives us opportunities to deal with batch processing and real-time processing as well. For example, we can use HBase, which is a no SQL distributed database which offers random access to large amounts of data.
We can also use Apache Storm as well, which is aimed at streaming data workloads to do real-time analysis of moving data streams. We can also use Apache Spark for high-performance complex workloads.
Azure operates HDInsight and open-source frameworks, including Apache Hadoop, Apache Spark, and Apache Kafka. Now, if you have a LinkedIn account, you probably use Apache Kafka without realizing it! LinkedIn uses Kafka is a way of processing data in real-time, and in vast amounts of data, well, we can imagine how many people are using LinkedIn at any one time, right? As you reach the end state itself, it’s very cost-effective for enterprises to move forward with open source analytics. Cloud computing gives you the opportunities available to you with open source technologies, and it also allows you to the robustness of a proprietary supporting network such as Microsoft Azure to process large amounts of data. So you can store it in Azure as if it is some sort of USB stick, but you can also use it to process and analyze large amounts of data as well.
Using open-source analytics means you can spin up significant clusters as and when you need them. Organizations can scale up or scale down in terms of the size, and you pay only for what you use. It’s a great way to get started with open source technology with a secure knowledge that you’re using a well-established framework which supports analytical processing.
One common question is about the security of the data when we use cloud technologies to meet industry and governments compliance standards, which are part of Microsoft Azure. Azure allows you to protect your enterprise data assets, using Azure Active Directory and setting up your virtual network. Other technologies, such as Azure Data Factory, can help process large amounts of data around in the cloud. You can also use Azure Data Lake storage as well, which is optimized for high-performance analytics. So, Microsoft Azure HDInsight allows you to stay up to date with the most up to date open-source frameworks, including Apache Hadoop, and Apache Spark, Apache Kafka, HBase and Apache Hive. It has native integration with other data sources, such as SQL Data Warehouse, Azure Cosmos, database storage, and even Azure Blob Storage as well.
So with Azure HDInsight in mind, what are the opportunities are there to use technologies and Microsoft Azure to steward our data? We can use Azure for many open source technologies. And we can also use Apache Spark, which is part of Databricks, but it’s important to understand what Apache Spark is before we start. So, let’s take a look!
Apache Spark is an open source software solution until allows you to work on data that’s held in memory. Apache Spark is what’s known as resilient which means that models can be created and recreated on the fly from a known state. The data is also distributed. And what that means is that the dataset is partitioned across multiple nodes, we can use that for increased scalability and parallelism. The basis for Apache Spark is what’s known as RDD, Resilient Distributed Data, creating models on the fly and partitioning the data across multiple nodes, and that’s what makes it so popular. It’s also popular because we can use it with Python, R and Scala. We can also use it with Jupyter Notebooks as well. Apache Spark also has native integration with Hive, HDFS, and any other Hadoop file system. Scala has about three times less code than Java, so it can be easier to read. It can also be faster to retrieve jobs, and it also promotes code reuse as well.
To summarize, Apache Spark offers a unified cloud platform for your data, for storing and for processing using multiple workloads. That includes very hot data sources such a real-time processing. Apache Spark also allows you to do Machine Learning, streaming analytics, interactive querying, and also data visualization, as well. Apache Spark also offers you and memory processing for really Big Data to create those models on the fly and use data to as a basis for your processing where you have Big Data sources. Apache Spark also has APIs for processing large datasets, and it’s up to 100 times faster than Hadoop. So, with that in mind, when we have Apache Spark, why would we need Databricks? Let’s explore Databricks for a moment.
Well, Databricks is an end to end managed Apache Spark platform, which exists in Azure, so we have the opportunities of using open source technologies. We can use the robustness of Microsoft Azure as a platform for doing Big Data analytics in the cloud. Databases also optimized very well for Apache Spark, and you can improve Apache Spark jobs by somewhere between 10 to 100 times. So, if you can see that Apache Spark is already increasing the speed of Hadoop by ten times, Databricks is taking that speed and applying further optimization to it, so it’s even faster. Again, it’s between 10 and 100 times faster than the Hadoop and Apache Spark. It is cost-efficient to execute large scale Spark workloads.
Often, when organizations are getting started with Big Data, they have a decision to make: whether to purchase a large amount of hardware or whether to get started in the cloud. Already there is some friction in getting started with the Data Science opportunities. With Azure Databricks, we can do many different things: We can do Data Science, we can do Data Engineering, which involves making sure that the data is in a good clean state. If you don’t have good quality data, then the Data Science will be ropey – and so will any Artificial Intelligence and Business Intelligence which is based on it. In the same way that a database offers you several things to help you to get to speed very quickly, we can also use the Jupyter Notebook experience, which helps us in the same way. You have the opportunity to implement security policies and secrets on that data. So, Azure Databricks connects to many different data sources. For example, we can use it for Blob Storage, which is a bit like a data attic, a place where you can store your data and just put it in the cloud.
Azure Data Lake Store
The Azure Data Lake Store is an optimized way of storing data, especially for analytics. It is aimed at providing high performance with the Azure SQL Data Warehouse. If you are a retail outlet, for example, then it’s very often good to have a data warehouse in the cloud because you can examine transactions as they are in flight.
Organizations can also use Hadoop storage as well, which is cheaper, but it can also be slower as well. Azure Databricks can be used with many applications for Deep Learning, for example, which is a way of taking neural net models and stringing them together. Organizations can do more complex analytics because they can use Databricks as a way of analyzing and sorting your data for you. For business users, the data is accessible in Power BI.
Azure Databricks is a one-click deployment which has auto-scaling features. It offers an optimized database runtime that can improve the performance of Spark jobs in the cloud by 10 to 100 times. It makes it very cost-effective to run very large-scale Spark workloads. Databricks is excellent for data engineers who need to make sure the data is clean and in good quality. Databricks offers you an interactive Notebook environment, and it has monitoring tools and security controls.
Data Science is an excellent opportunity for everyone throughout the business to understand data. Azure Databricks offers us the opportunity to visualize our data using Power BI and Azure SQL Database also supports it. For example, you could put your results into an Azure SQL Database. Then Power BI could pick the data up from Databricks to also facilitate great collaboration between data scientists and business users, and that’s one reason that Azure Databricks is so popular.
Azure Data Lake Analytics
Both Azure Databricks and Data Lake Analytics can be used for batch processing. How do we know which one to choose? Well, Azure Data Lake Analytics is a distributed computing resource, and it is excellent when you are using massive amounts of data. It uses Microsoft’s proprietary U- SQL language to allow data scientists to crunch through that data and to load that data into downstream Azure systems and databases. It enables you to work with distributed processing and distributed data in a SQL-like language, so it is easier for data engineers to get started with it.
Azure Data Lake Analytics is a cloud-based system which is virtually unlimited in its size. Note that when you stop that cluster, the data also goes away, so it is one differentiator between HDInsight and Data Lake Analytics. With the Data Lake, the job runs and the data remains. HDInsight is primarily used for processing data. So, when you tear down the HDInsight cluster, the data is removed as well. HDInsight is an analytic service which is a functional analytics cluster which gives you the insights using cluster managers that include open source packages for analytics such as Hadoop, Spark, and so on. Data Scientists can set up your cluster to use Azure Data Lake Storage so you can actually use Data Lake Analytics to preserve the data and HDInsight to crunch and process through that data when you need it.
Looking at it from the database angle, we could have SQL Server on a virtual machine, or we could use Azure SQL Database. Now, with Azure SQL Database offers Database as a Service and Microsoft looks after it for you. On the other hand, if you use Microsoft SQL Server on a virtual machine, then you are responsible for the maintenance and the security.
Azure Cosmos database is a vast operational data store, which can connect to other API’s, such as SQL, and it can use Gremlin as well. We can use Azure Blob and Azure Table Storage. In reality, I don’t tend to see table storage that much, but I do tend to see it for financial data. For example, we can use Azure Table Storage. To import financial sentinel data from some of the financial services who look after currency data and then you could then crunch through that data file and store it in an Azure Table Storage table. It tends to be transient data, if it’s a currency transaction rate, for example, so it may not be something you need to store for very long because it gets updated frequently anyway.
Note that SQL Server and Azure SQL Database are fully featured relational databases. Databricks is more focused on collaboration with streaming and batch processing, and it provides a Notebook experience for the user. Databricks integrates well with Azure, and it has good Azure Active Directory authentication. On the other hand, HDInsight has got Apache, Kafka, Storm and Hive, which Databricks does not have in place but it can be suitable for processing extensive datasets if you just want to ‘set it and forget it’. Sometimes, you can see both technologies used. Databricks can be more friendly for data scientists where you are not that familiar with the data source. Databricks and HDInsight offer different language experiences. HDInsight offers Java, Python, and dotnet while the clusters include Spark HBase, Kafka and Hadoop. HDInsight supports languages that are supported in Hadoop such as Apache Pig, Hive, QL and Spark SQL. You should note however if they are not installed by default, you need a script action to install those but note that we have Python SQL and R, and Scala which is a very commonly used language anyway.
Azure Machine Learning
Using these data stores as a basis, you can access data using Azure Machine Learning as a data source and a data set to consume data and mash it together in preparation for Azure Machine Learning runs. You can also interact with the data using Azure Machine Learning SDK for Python. I
Azure Machine Learning does a neat thing called Data Drift. You can set up dataset monitors and Machine Learning to track how your data is changing over time and measure it against the model for performance. That’s handy if you have very Big Data flows, for example, because it’s not a trivial exercise to move and copy those data files. So, if the data sources are substantial, then it’s useful to be used for connection to one source, rather than having numerous copies of the data everywhere. The same data is shared with different Data Science teams across various projects.
To summarize, there are plenty of data stores and sources to choose from in Azure, and then you can make the most of your data in Azure Machine Learning and Power BI. Hopefully, this forms the basis for using your data with technologies such as Azure Machine Learning to start your data science journey.