Azure CosmosDB, Azure Data Lake Analytics and R sessions at Microsoft Data and BI Summit BA

I’m excited to be speaking three times at the Data & BI Summit in Dublin, 24th – 26th April. It’s extra special for me since it will be my first event as a Microsoft Regional Director and also after having been named one of the top 20 women in Artificial Intelligence, Data Science, Machine Learning and Big Data by Big Data Made Simple by the team over at Crayon Data.

I’m speaking on the following topics:

  • R and Power BI
  • Azure CosmosDB and Power BI
  • Azure Data Lake Analytics with Power BI – details to be announced as there have been a few logistic changes.

Here are the details below:

PUGDV07 – R in Power BI for Absolute Beginners

When: Tuesday, April 24 16:00 – 17:00 Where: Liffey Meeting Room 1 (it’s on the first floor) In this session, we will start R right from the beginning, from installing R through to data transformation and integration, through to visualizing data by using R in Power BI. Then, we will move towards powerful but simple to use datatypes in R such as data frames. We will also upgrade our data analysis skills by looking at R data transformation using a powerful set of tools to make things simple: the tidyverse. Then, we will look at integrating our R work into Power BI, and visualizing our data using beautiful visualizations with R and Power BI. Finally, we will share our work by publishing our Power BI project, with our R code, to the Power BI service. We will also look at refreshing our dataset so that our new dashboard has refreshed data. This session is aimed at getting beginners up to speed as gently and quickly as possible. Join this session if you are curious about R and want to know more. If you are already a Power BI expert, join this session to open up a whole new world of Power BI to add to your skill set. If you are new to Power BI, you will still get value from this session since you’ll be able to see a Power BI dashboard being built in an end-to-end solution.

PUGDV11 – Data Analytics with Azure Cosmos Schema-less Data and Power BI

When: Thursday, April 26 15:00 – 16:00

Where: Liffey Meeting Room 5 (it’s on the first floor)

Good news for Developers and Data Analysts; it’s possible to have rapid application development and analytics with the same data source, using Azure Cosmos DB and Power BI.
Azure Cosmos DB is a schemaless database, so how is it possible to analyse and create reports of the data for analytics and Business Intelligence tools? A single Azure Cosmos DB database is great for rapid application development because it can contain JSON documents of various structures, but this needs careful data visualization. In this session, we will analyze and create reports of Azure Cosmos data using Power BI, looking at data from both developer and data analyst aspects.
In this demo-rich session, you will learn about Azure Cosmos, understand its use cases, and see how to work with data in a schemaless Azure Cosmos database for Power BI.

Hope to see you there!

Issues and Resolutions in starting R and R Server on SQL Server 2017

I am helping some people learn Data Science and we are having a ton of fun! There are lots of things to remember. So I am noting things here, in case I forget!

We noted the following error message, when we saw that R was not running on our SQL Server 2017 install:

‘sp_execute_external_script’ is disabled on this instance of SQL Server. Use sp_configure ‘external scripts enabled’ to enable it.

Here is the longer version:

Msg 39023, Level 16, State 1, Procedure sp_execute_external_script, Line 1 [Batch Start Line 3]

‘sp_execute_external_script’ is disabled on this instance of SQL Server. Use sp_configure ‘external scripts enabled’ to enable it.

Msg 11536, Level 16, State 1, Line 4

EXECUTE statement failed because its WITH RESULT SETS clause specified 1 result set(s), but the statement only sent 0 result set(s) at run time.

Grr! What’s happened here? We had installed R as part of the SQL installation, and we had run the command to enable it, too. In case you are wondering, here is the command:

EXEC sp_configure ‘external scripts enabled’, 1
RECONFIGURE WITH OVERRIDE

So what happens next? Initial things to check:

Is R Server installed properly along with SQL Server? Here are some guidelines to help you.

Is the Launchpad service running? One of my colleagues and friends Tomaž Kaštrun  wrote a nice article on SQL Server Central. If not, this could be due to a lack of permissions in being able to start the service.

Did you restart the MSSQL Service on the machine? This will also restart the Launchpad service as well. If you didn’t restart the service, you will need to do that so it can pick up the results.

Once R is running properly, you can check it by using the following command, borrowed from the official installation guide over at Microsoft:

EXEC sp_execute_external_script @language =N’R’,
@script=N’
OutputDataSet <- InputDataSet;
‘,
@input_data_1 =N’SELECT 1 AS RIsWorkingFine’
WITH RESULT SETS (([RIsWorkingFine] int not null));
GO

If that returns a 1, then you are all set! To prove it works properly, you can retrieve the world famous Iris dataset using the following command, borrowed from the official documentation on sp_execute_external_script:

DROP PROC IF EXISTS get_iris_dataset;

go

CREATE PROC get_iris_dataset

AS BEGIN

EXEC sp_execute_external_script @language = N‘R’ , @script = N‘iris_data <- iris;’ , @input_data_1 = N , @output_data_1_name = N‘iris_data’ WITH RESULT SETS ((“Sepal.Length” float not null, “Sepal.Width” float not null, “Petal.Length” float not null, “Petal.Width” float not null, “Species” varchar(100)));

END;

GO

Once you’ve created the command, execute the following SQL command and you will see the iris dataset:

exec get_iris_dataset

You’re all set! Enjoy R!

Learning pathway for SQL Server 2016 and R Part 2: Divided by a Common Language

http://whatculture.com/film/the-office-uk-vs-the-office-us.php

Britain has “really everything in common with America nowadays, except, of course, language.” Said Oscar Wilde, in the Centerville Ghost (1887) whilst George Bernard Shaw is quoted as saying that the “The United States and Great Britain are two countries separated by a common language.”

There are similarities and differences between SQL and R, which might be confusing. However, I think it can be illuminating to understand these similarities and differences since it tells you something about each language. I got this idea from one of the attendees at PASS Summit 2015 and my kudos and thanks go to her. I’m sorry I didn’t get  her name, but if you see this you will know who you are, so please feel free to leave a comment so that I can give you a proper shout out.

If you are looking for an intro to R from the Excel perspective, see this brilliant blog here. Here’s a list onto get us started. If you can think of any more, please give me a shout and I will update it. It’s just an overview and it’s to help the novice get started on a path of self-guided research into both of these fascinating topics.

R SQL / BI background
A Factor has special properties; it can represent a categorical variable, which are used in linear regression, ANOVA etc. It can also be used for grouping. A Dimension is a way of describing categorical variables. We see this in the Microsoft Business Intelligence stack.
in R, dim means that we can give a chunk of data dimensions, or, in other words, give it a size. You could use dim to turn a list into a matrix, for example Following Kimball methodology, we tend to prefix tables as dim if they are dimension tables. Here, we mean ‘dimensions’ in the Kimball sense, where a ‘dimension’ is a way of describing data. If you take a report title, such as Sales by geography, then ‘geography’ would be your dimension.
R memory management can be confusing. Read Matthew Keller’s excellent post here. If you use R to look at large data sets, you’ll need to know
– how much memory an object is taking;
– 32-bit R vs 64-bit R;
– packages designed to store objects on disk, not RAM;
– gc() for memory garbage collection
– reduce memory fragmentation.
SQL Server 2016 CTP3 brings native In-database support for the open source R language. You can call both R, RevoScaleR functions and scripts directly from within a SQL query. This circumvents the R memory issue because SQL Server benefits the user, by introducing multi-threaded and multi-core in-DB computations
Data frame is a way of storing data in tables. It is a tightly coupled collections of variables arranged in rows and columns. It is a fundamental data structure in R. In SQL SSRS, we would call this a data set. In T-SQL, it’s just a table. The data is formatted into rows and columns, with mixed data types.
All columns in a matrix must have the same mode(numeric, character, and so on) and the same length. A matrix in SSRS is a way of displaying, grouping and summarizing data. It acts like a pivot table in Excel.
 <tablename>$<columnname> is one way you can call a table with specific reference to a column name.  <tablename>.<columname> is how we do it in SQL, or you could just call the column name on its own.
To print something, type in the variable name at the command prompt. Note, you can only print items one at a time, so use cat to combine multiple items to print out. Alternatively, use the print function. One magic feature of R is that it knows magically how to format any R value for printing e.g.

print(matrix(c(1,2,3,5),2,2))

PRINT returns a user-defined message to the client. See the BOL entry here. https://msdn.microsoft.com/en-us/library/ms176047.aspx

CONCAT returns a string that is the result of concatenating two or more string values. https://msdn.microsoft.com/en-GB/library/hh231515.aspx

Variables allow you to store data temporarily during the execution of code. If you define it at the command prompt, the variable is contained in your workspace. It is held in memory, but it can be saved to disk. In R, variables are dynamically typed so you can chop and change the type as you see fit. Variables are declared in the body of a batch or procedure with the DECLARE statement and are assigned values by using either a SET or SELECT statement. Variables are not dynamically typed, unlike R. For in-depth look at variables, see Itzik Ben-Gan’s article here.
ls allows you to list the variables and functions in your workspace. you can use ls.str to list out some additional information about each variable. SQL Server has tables, not arrays. It works differently, and you can find a great explanation over at Erland Sommarskog’s blog. For SQL Server 2016 specific information, please visit the Microsoft site.
A Vector is a key data structure in R, which has tons of flexibility and extras. Vectors can’t have a mix of data types, and they are created using the c(…) operator. If it is a vector of vectors, R makes them into a single vector. Batch-mode execution is sometimes known as vector-based or vectorized execution. It is a query processing method in which queries process multiple rows together. A popular item in SQL Server 2016 is Columnstore Indexes, which uses batch-mode execution. To dig into more detail, I’d recommend Niko Neugebauer’s excellent blog series here, or the Microsoft summary.

There will be plenty of other examples, but I hope that helps for now.

Learning pathway for SQL Server 2016 and R Part 1: Installation and configuration

Jargogled is an archaic word for getting confused or mixed up. I’m aware that there are lots of SQL Server folks out there, who are desperate to learn R, but might be jaRgogled by R. Now that R is in SQL Server, it seems like the perfect opportunity to start a new blog series to help people to splice the two great technologies together. So here you are!

First up, what do you need to know about SQL Server installation with R? The installation sequence is well documented here. However, if you want to make sure that the R piece is installed, then you will need to make sure that you do one thing: tick the Advanced Analytics Extension box.

SQL Server 2016 R Feature Selection

You need to select ‘Advanced Analytics Extensions’, which you will find under ‘Instance Features’. Once you’ve done that, you are good to proceed with the rest of your installation.

Once SQL Server is installed, let’s get some data into a SQL Server database. Firstly, you’ll need to create a test database, if you don’t have one already. You can find some information on database creation in SQL Server 2016 over at this Microsoft blog. You can import some data very quickly and there are different ways of importing data. If you need more information on this, please read this Microsoft blog.

If you fancy taking some sample data, try out the UCI Machine Learning data repository. You can download some data from there, following the instructions on that site, and then pop it into SQL Server.

If you have Office x64 installed on your machine, you might run into an issue:

Microsoft.ACE.OLEDB.15.0′ provider is not registered on the local machine

I ran into this issue when I tried to import some data into SQL Server using the quick and dirty ‘import data’ menu item in SSMS. After some fishing around, I got rid of it by doing the following:

There are other ways of importing data, of course, but I wanted to play with R and SQL Server, and not spend a whole chunk of time importing data.

In our next tutorial, we will look at some of the vocabulary for R and SQL Server which can look confusing for people from both disciplines. Once you learn the terminology, then you’ll see that you already know a lot of the concepts in R from your SQL Server and Business Intelligence expertise. That expertise will help you to springboard to R expertise, which is great for your career.

PASS Summit Notes for my AzureML, R and Power BI Presentation

I’m going to have fun with my AzureML session today at PASS Summit! More will follow on this post later; I am racing off to the keynote so I don’t have long 🙂

I heard some folks weren’t sure whether to attend my session or Chris Webb’s session. I’m honestly flattered but I’m not in the same league as Chris! I’ve posted my notes here so that folks can go off and attend Chris’ session, if they are stuck between the two.

Here is the order of things:

  • Slide Deck
  • How do you choose a machine learning algorithm?
  • How do you carry out an AzureML project?
  • AzureML Experiment 
  • R Code

So, the slide deck is here:

  • AzureML Experiment 

You can see this experiment in the AzureML Gallery. You may have to sign up for a Windows Live account to get a free AzureML studio account, and I recommend that you do.

  • How do you choose a machine learning algorithm?

Kudos to Microsoft – this is their cheatsheet and I recommend that you look at the original page.

Here is some more information on the topic from Microsoft, and I recommend that you follow it.

How do you carry out an AzureML project?

Try the CRISP-DM Framework for a start

See the Modelling Agency for the original source. https://the-modeling-agency.com/crisp-dm.pdf

CRISP-DM Process Diagram.png
CRISP-DM Process Diagram” by Kenneth JensenOwn work. Licensed under CC BY-SA 3.0 via Commons.

R Code

Here’s a sample R code. I know it is simple, and there are better ways of doing this. However, remember that this is for instructional purposes in front of +/- 500 people so I want to be sure everyone has a grounding before we talk more complicated things.

You may have to install the libraries first, if you haven’t done so.

library(data.table)
library(ggplot2)
library(xtable)
library(rpart)
require(xtable)
require(data.table)
require(ggplot2)
require(rpart)

summary(adult.data)
class(adult.data)

# Let’s rename the columns
names(adult.data)[1]<-“age”
names(adult.data)[2]<-“workclass”
names(adult.data)[3]<-“fnlwgt”
names(adult.data)[4]<-“education”
names(adult.data)[5]<-“education.num”
names(adult.data)[6]<-“marital.status”
names(adult.data)[7]<-“occupation”
names(adult.data)[8]<-“relationship”
names(adult.data)[9]<-“race”
names(adult.data)[10]<-“sex”
names(adult.data)[11]<-“capital.gain”
names(adult.data)[12]<-“capital.loss”
names(adult.data)[13]<-“hours.per.week”
names(adult.data)[14]<-“country”
names(adult.data)[15]<-“earning_level”

# Let’s see if the columns renamed well
# What is the maximum age of the adult?
# How much data is missing?
summary(adult.data)

# How many rows do we have?
# 32561 rows, 15 columns
dim(adult.data)

# There are lots of different ways to deal with missing data
# That would be a session in itself!
# For demo purposes, we are simply going to replace question marks, and remove rows which have anything missing.

adult.data$workclass <- as.factor(gsub(“[?]”, NA, adult.data$workclass))
adult.data$education <- as.factor(gsub(“[?]”, NA, adult.data$education))
adult.data$marital.status <- as.factor(gsub(“[?]”, NA, adult.data$marital.status))
adult.data$occupation <- as.factor(gsub(“[?]”, NA, adult.data$occupation))
adult.data$relationship <- as.factor(gsub(“[?]”, NA, adult.data$relationship))
adult.data$race <- as.factor(gsub(“[?]”, NA, adult.data$race))
adult.data$sex <- as.factor(gsub(“[?]”, NA, adult.data$sex, fixed = TRUE))
adult.data$country <- as.factor(gsub(“[?]”, NA, adult.data$country))

is.na(adult.data) = adult.data==’?’
is.na(adult.data) = adult.data==’ ?’
adult.tidydata = na.omit(adult.data)

# Let’s check out our new data set, called adult.tidydata
summary(adult.tidydata)

# How many rows do we have?
# 32561 rows, 15 columns
dim(adult.tidydata)

# Let’s visualise the data
boxplot(adult.tidydata$education.num~adult.tidydata$earning_level,outline=F,xlab=”Income Level”,ylab=”Education Level”,main=”Income Vs Education”)

prop.table(table(adult.tidydata$earning_level,adult.tidydata$occupation),2)
for (i in 1:ncol(adult.tidydata)-2) {
if (is.factor(adult.tidydata[,i])){
pl =ggplot(adult.tidydata,aes_string(colnames(adult.tidydata)[i],fill=”earning_level”))+geom_bar(position=”dodge”) + theme(axis.text.x=element_text(angle=75))
print(pl)
}

}

evalq({
plot <- ggplot(data = adult.tidydata, aes(x = hours.per.week, y = education.num,
colour = hours.per.week))
plot <- plot + geom_point(alpha = 1/10)
plot <- plot + ggtitle(“Hours per Week vs Level of Education”)
plot <- plot + stat_smooth(method = “lm”, se = FALSE, colour = “red”, size = 1)
plot <- plot + xlab(“Education Level”) + ylab(“Hours per Week worked”)
plot <- plot + theme(legend.position = “none”)
plot
})

That’s all for now! More later.

Jen xx