Azure Tools and Technologies Cheat Sheet

Don’t you think that the amount of Big Data technologies in Azure is very confusing? I’ve distilled some information from the official Microsoft Azure blog so it’s easier to read. Before we begin, here is a potted Hadoop history:

 

This cheatsheet contains a high level descriptions of the tools, APIs, SDKs, and technologies that you’ll see in Azure. Together, they are used in tandem with big data solutions, and they include proprietary Azure and open source technologies.

I hope that this cheat sheet will help you to more easily identify the tools and technologies you should investigate, depending on the function.

Function Description Tools
Data consumption Extracting and consuming the results from Hadoop-based solutions. Azure Intelligent Systems Service (ISS), Azure SQL Database,LINQ to Hive, Power BI, SQL Server Analysis Services (SSAS),SQL Server Database Engine,SQL Server Reporting Services (SSRS)
Data ingestion Extracting data from data sources and loading it into Hadoop-based solutions Aspera, Avro, AZCopy, Azure Intelligent Systems Service (ISS), Azure Storage Client Libraries, Azure Storage Explorer, Casablanca,Cloudberry Explorer, CloudXplorer, Cross-platform Command Line Interface (X-plat CLI),File Catalyst, Flume, Hadoop Command Line, HDInsight SDK and Microsoft .NET SDK for Hadoop, Kafka, PowerShell,Reactive Extensions (Rx), Signiant,SQL Server Data Quality Services (DQS),SQL Server Integration Services (SSIS),Sqoop,Storm,StreamInsight,Visual Studio Server Explorer
Data processing Processing, querying, and transforming data in Hadoop-based solutions Azure Intelligent Systems Service (ISS),Hcatalog, Hive,LINQ to Hive, Mahout,Map/reduce, Phoenix, Pig, Reactive Extensions (Rx), Samza, Solr,SQL Server Data Quality Services (DQS),Storm,StreamInsight
Data transfer Transfer  data between Hadoop and other data stores such as databases and cloud storage. Falcon,SQL Server Integration Services (SSIS)
Data visualization Visualizing and analyzing the results from Hadoop-based solutions. Azure Intelligent Systems Service (ISS), D3.jx, Microsoft Excel, Power BI, Power Map, Power Query, Power View, PowerPivot
Job submission Processing jobs  in Hadoop-based solutions. HDInsight SDK and Microsoft .NET SDK for Hadoop
Management Manage and monitor  Hadoop-based solutions. Ambari, Azure Storage Client Libraries, Azure Storage Explorer, Cerebrata Azure Management Studio, Chef, Chukwa, CloudXplorer, Ganglia, Hadoop command line,Knox,Azure Management Portal, Azure SDK for Node.js,Puppet, Remote Desktop Connection, REST APIs,System Center management pack for HDInsight,Visual Studio Server Explorer
Workflow Creating workflows and managing multi-step processing in Hadoop-based solutions. Azkaban, Cascading, Hamake, Oozie,SQL Server Integration Services (SSIS)

Any questions, please get in touch at hello@datarelish.com

SQL Server on Linux for the Business Intelligence Professional: Getting your Database on the VM

In this edition, we will look at getting the database up on the Virtual Machine as a first step. Then, we will restore it using SSMS. I’m expecting that you will have done the pre-requisites that I laid out yesterday. You will also need to have connected to your database on the Azure Virtual Machine.

For this purpose, we will use the WideWorldImporters sample databases provided by Microsoft.

If you are doing this activity on your own database: Make sure that the database backup type is Full. Also, make sure you are backing up to disk. which is most likely to be found in this location:

C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Backup\

Connecting via SSH to the Azure Virtual Machine

This is when we start to use SCP (Secure File Copy) and SSH (Secure Shell) to copy our BAK file up to the Linux Virtual Machine.

We will use Git for Windows in this example. Here is how the commands look:

Bash Open Connection

I’m going to deal out the commands here because it will help you.

Firstly, you need to copy the Azure connection command from an earlier step. Mine looks like this:

ssh datarelish@13.68.23.71

Type yes to continue, and hit return

Enter the password that you created when you set up the Azure Virtual Machine

Let’s execute the dir command so we can see what is up there.

From the last line, you can see that the directory contains one file: test_data.txt

Copying the file from your local Machine to the Azure Virtual Machine

I find it easier to have another Git window open here. This will point at the local machine, whereas the other will point at the Azure Linux VM.

In this new Git window, we are going to use SCP to copy the file from the local machine up to the remote Azure Virtual Machine.

Change your Directory to the location where you have stored the WideWorldImporters database. My location is D:\773WorkingDirectory

You can use commands such as dir and cd to help you to navigate.

To get to my D drive, I typed cd d:

I then used dir to read out the files and folders on the D drive

I then used cd 773* to get to the Directory I wanted, which is called 773WorkingDirectory

When I got there, I used the dir command to get to the databases I wanted.

To copy the file to the Azure Virtual Machine, I then executed the following command from the D:\773WorkingDirectory folder:

scp WideWorldImporters-DW-Full.bak datarelish@13.68.23.71:./

This landed the data to my home\datarelish directory. I could then use the cp command to copy the file so that the backup file ended up in the /var/opt/mssql/data directory.

In the next post, we will look at restoring the database onto SQL Server on Linux.

 

 

 

SQL Server on Linux for the Business Intelligence Professional: Connecting to your database on Azure Virtual Machine

What’s next for the Business Intelligence professional, as they learn about Linux? Here, we will connect SQL Server Management Studio to the Linux edition of SQL Server. I’m expecting you’ll have completed the pre-requisites laid out previously.

Connect SQL Server Management Studio to SQL Server Linux

Add in an inbound security rule which allows remote connection via TCP/1433.
  • In the Azure Portal, navigate to the SQL Server on Linux Virtual Machine. Mine is called 773SQL because I created it to prepare for the 773 exam.
  • Select the item Network Interfaces on the left hand side.
  • On the main panel, select the Network Interface
  • Next, look for the Network Security Group, which you can see under ‘Essentials’ on the right hand side.

Inbound Security Rules

  • Click on Add and Give the rule a name: SSMSRemoteConnections will be fine, if you can’t think of a name
  • Under Service, select MS SQL
  • Make sure TCP is selected
  • Port Range should read 1433
  • Select Allow and click OK

Your inbound Security Rule has been created. Now, let’s connect to your SQL Server Linux machine.

Get your Virtual Machine IP address from the Azure Portal

In this section, we will get the IP address of the Virtual Machine that’s running SQL Server on Linux.

In the Azure Portal, navigate to your Virtual Machine

If it isn’t up and running, you will need to start it. When it’s running successfully, click on the Connect button.

You will get a message that looks like the following item:

IP Address for Linux Connection

Make sure that you keep a note of the IP address. You will need it to connect SSMS to the SQL Server Linux Virtual Machine. In this example, the IP is 13.68.23.71. I have it set to Dynamic and the VM will most likely be gone by the time this blog is published…. so don’t bother trying to connect to this IP address here. You must use your own, obtained by following the previous step.

Connect SSMS to the SQL Server on Linux VM

Here is an example of how it should look – obviously you will need the password!

The IP Address was obtained from the Azure Portal

You will need to use SA

You will need to use the Password that you used when you set up the Azure Virtual Machine

Click connect and you should see your SQL Server now, as normal.

SSMS Side Panel

Let’s double check, shall we?

Checking the Version

Open a New Query window

Enter the Query: SELECT @@VERSION

Execute the query, and take a look at the results:

In the next blog in this series, let’s take a look at how we can get a database up to the SQL Server on Linux Virtual Machine, and then how we can restore it. We will also look at a potential pitfall or two.

The Previous Post is here

SQL Server on Linux for the Business Intelligence Professional: Prerequisites

In this blog series, I’ll talk about SQL Server on Linux for the Business Intelligence Professional. You’ll need some prequisites, and this blog post edition is dedicated to helping you to get set up.

It’s assumed that you have set up a SQL Server on Linux Virtual Machine on Azure.

To connect to it, you will need an SSH client. SSH stands for Secure Shell. SSH is used to log into a remote machine and execute commands. It also supports other features, such as tunneling, forwarding TCP ports and X11 connections. It can transfer files using the associated SSH file transfer (SFTP) or secure copy (SCP) protocols. SSH uses the client-server model.

I’ve put some instructions here so you have an SSH client installed.  The options are below:

Installing an SSH Client

If you don’t have a client installed, you have a lot of options for SSH Clients. For our purposes, there are no differences although I recommend Chocolatey because it has some great package features. These choices are listed here:

  • Choose Git for Windows I will go through the examples using this client
  • Use Chocolatey. SSH is also part of Chocolatey 
  • PuTTY is featured in the Microsoft documentation

Git for Windows

You can install Git for Windows, which has an SSH client.  There are a ton of front end tools.

Chocolatey

To install it, run a command console as Administrator. To do this, type CMD in the Cortana search window, and options will come up. Right click on the Console, and select the option Run as Administrator.
Copy and Paste the following command, taken from the Chocolatey Site:
@"%SystemRoot%\System32\WindowsPowerShell\v1.0\powershell.exe" -NoProfile -InputFormat None -ExecutionPolicy Bypass -Command "iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))" && SET "PATH=%PATH%;%ALLUSERSPROFILE%\chocolatey\bin"
You can also install the Chocolatey GUI.  The command to do this is as follows, again, credit to the Chocolatey team over at the website itself:

choco install chocolateygui
You can use PuTTY, as Microsoft Books online show. I like using Git Windows simply because my customers usually have it installed already, and it’s not an extra step to install it. SSH is part of the Git for Windows bundle so you probably already have it.
When your SSH client is installed, then you can connect to your SQL Server Linux machine. I will cover this item in the next post.

How do you evaluate the performance of a Neural Network? Focus on AzureML

I read the Microsoft blog entitled ‘How to evaluate model performance in Azure Machine Learning‘. It’s a nice piece of work, and it got me thinking. I didn’t see that the blog post contained anything about neural network evaluation, so this topic is covered here.

How do you evaluate the performance of a Neural Network? This blog focuses on Neural Networks in AzureML, in order to help you to understand what they mean.

What are Neural Networks?

Would you like to know how to make predictions from a dataset? Alternatively, would you like to find exceptions, or outliers, that you need to watch out for? Neural networks are used in business to answer the business questions. They are used to make predictions from a dataset, or to find unusual patterns. They are best used for regression or classification business problems.

What are the different types of Neural Networks?

I’m going to credit the Asimov Institute with this amazing diagram:

Neural Network Types

In AzureML, we can review the output from a neural network experiment that we created previously. We can see the results by clicking on the Evaluation Model task, and clicking on the Visualise option.

Once we click on Visualise, we can see a number of charts, which are described here:

  • Receiver Operating Curve
  • Precision / Recall
  • Lift visualization

The Receiver Operating Curve

Here is an example:

ROC Curve

In our example, we can see that the curve well up into the left hand corner for the ROC curve. When we look on the precision and recall curve, we can see that precision and recall are high figures, and this leads to a high F1 score. This means that the model is effective in terms of how precisely it classifies the data, and that it covers a good proportion of the cases that it should have classified correctly.

Precision and Recall

Precision and recall are very useful for assessing models in terms of business questions. They offer more detail and insights into the model’s performance. Here is an example:

Precision and Recall Precision can be described as the fraction of times that the model classifies the number of cases correctly. It can be considered as a measure of confirmation, and it indicates how often the model is correct. Recall is a measure of utility, which means that it identifies how much that the model finds of all that there is to find within the search space. Both scores combine to make the F1 score. The F1 score combines Precision and Recall. If either precision and recall are small, then the F1 score value will be small.

Lift Visualisation

Lift Chart visually represents the improvement that a model provides when compared against a random guess.This is called a lift score. With a lift chart, you can compare the accuracy of predictions for the models that have the same predictable attribute. Lift Visualisation

Summary

In my next blog, I’ll talk a little about how we can make the Neural Network perform better.

To summarise, we have examined various key metrics in evaluating a neural network in AzureML. These scores also apply to other technologies, such as R.

These criteria can help us to evaluate our models, which, in turn, can help us to fundamentally evaluate our business questions. Understanding the numbers helps to drive the business forward, and visualizing these numbers helps to convey the message of the numbers.

Want to learn how to light up Big Data Analytics using Apache Spark in Azure?

Businesses struggle with many different aspects of data and technology. It can be difficult to know what technology to choose. Also, it can be hard to know where to turn, when there are so many buzzwords in the mix: analytics, big data and open source. My session at PASS Summit is essentially talking about these things, using Azure and Apache Spark as a backdrop.

Vendors tend to tell their version of events, as you might expect, so it becomes really hard to get advice on how to have a proper blueprint to get you up and running. In this session, I will examine strategies for using open source technologies to improve existing common Business Intelligence issues, using Apache Spark as our backdrop to delivering open source Big Data analytics.

Once we have looked at the strategies, we will look at your choices on how to make the most of the open source technology. For example, how can we make the most of the investment? How can we speed things up? How can we manipulate data?

itoa-illustration-1200x572

These business questions are translated into technical terms. We will explore how we can parallelize your computations across nodes of a Hadoop cluster, once your clusters are set up. We will look combine use of SparkR for data manipulation with ScaleR for model development in Hadoop Spark. At the time of writing, this scenario requires that you maintain separate Spark sessions, only running one session at a time, and exchange data via CSV files. Hopefully, in the near future, we’ll see an R Server release, when SparkR and ScaleR can share a Spark session and so share Spark DataFrames. Hopefully that’s out prior to the session so we can see it, but, nevertheless, we will still look at how ScaleR works with Spark and how we can use Sparkly and SparkR within a ScaleR workflow.

Join my session at PASS Summit 2017 to learn more about open source with Azure for Business Intelligence, with a focus on Azure Spark.

An MVP for 7 years – what’s next?

a10a2d58-02cf-41cb-ab94-f0f31d539c43-original

 

Well, it’s been a hard year, for a number of reasons, but I appear to have come out the other side.

Looking forward, what comes next?

New things!

dscn08212

As some of you know, I care deeply about diversity in technology.

I have set up a Diversity Charter Slack channel to encourage user group leaders to talk about diversity and how we can encourage user group leaders to think about these issues.

I have set up an effort to have a Diversity Charter that user groups can use. I need help with things like logos, thoughts on a website and so on – so please do help if  you can!

The Diversity Charter looks like this, so far:

We believe that all members of the technical community are equally important.
We are part a tech community where we value a diverse network, and learn and share from one another:
regardless of age,
regardless of colour,
regardless of their ethnicity,
regardless of their religion or beliefs,
regardless of disability,
regardless of gender or sexual orientation,
regardless of their race,
regardless of their ability or lack of ability,
regardless of nationality or accent.
We are a diverse tech community where we are all individuals with differences, but we are all members and we can all learn from each other.

I look forward to your thoughts. Please do join my Slack channel diversitycharter.slack.com/ or ping me an email at diversity@datarelish.com in order to get an invite.

I will continue to help share my knowledge through blogging, writing, speaking, presenting, and increase my online presence. At heart, I am a content producer. It’s what I do, and it’s what I love.

I will continue working hard on the PASS Board. I just attended a Board meeting, which took place two nights during the week in the PST timezone. I am based in the GMT timezone, so I had a few very late nights or very early mornings, depending on your view. My recent focus is as a ‘trusted advisor’ capacity so I am helping to drive the new developer initiatives and business analytics initiatives in a strategic manner.

To keep the community fresh, I will continue to try to help to develop other community leaders. I have nominated a lot of people for the MVP Award this year, including David Moss,  Tomaz Kastrun and other people that I won’t mention, because they weren’t successful this time.