Day 6: The Data Analysts Toolkit: Why are Excel and R useful together, and how do we connect them?



https://www.flickr.com/photos/38953955@N07/14764024879/player/
Why is analytics interesting? Well, companies are starting to view it as profitable. For example, McKinsey showed analytics was worth 100Bn today, and estimated to be over 320Bn by 2020.
When I speak to customers, this is the ‘end goal’ – they want to use their data in order to analyse and predict what their customers are saying to them. However, it seems that folks can be a bit vague on what predictive modelling actually is.



I think that this is why Power BI and Excel are a good mix together. It makes concepts like Predictive Modelling accessible, after a bit of a learning curve. Excel is accessible and user-friendly, and we can enhance our stats delivery using R as well as Excel.

One area of interest is Predictive Modelling. This is the process of using a statistical or model to predict the value of a target variable. What does this actually mean?  Predictive modelling is where we work to the predict values in new data, rather than trying to explain an existing data set. To do this, we work with variables. By their nature, these vary; if they didn’t, they would be called a constant.

One pioneer was Francis Galton, who was a bit of an Indiana Jones in his day.  Although he wrote in the 19th century, his work is considered good and clear enough to read today. Therefore, this research has a long lineage, although it seems to be a new thing. We will start with the simplest: linear regression.

Linear regression compares two variables x and y to answer the question, “How does y change with x?” For predictive modelling, we start out with what are known as ‘predictor variables’; in terms of this question, this would be x. The result is called the target variable. In this question, this would be y. Why would we do this?

  • Machine Learning
  • Statistics
  • Programming with Software
  • Programming with Data 
  • Fun!

Why would businesses work with it at all?

  • to discover new knowledge and patterns in the data
  • to improve business results 
  • to deliver better customised services

If we have only one predictor variable and the response and the predictor variable have a linear relationship, the data can be analyzed with a simple linear model. When there is more than one predictor variable, we would use multiple regression. In this case, our question would be: , “How does y change with multiple x?” 

In fitting statistical models in which some variables are used to predict others, we want to find is that the x and y variables do not vary independently of each other, but that they tend to vary together. We hope to find that y is varying as a straight-line function of x.

If we were to visualise the data, we would hope to find a pleasing line chart which shows y and x  relating to each other in a straight line, with a minimal amount of ‘noise’ in the chart. Visualising the data means that the relationship is very clear; analysing the data means that the data itself is robust and it has been checked.

I think that’s why, in practice, Power BI, Excel and R work well together. R has got some great visualisations, but people are very comfortable with Excel for visualisations. All that loading packages stuff you have to do in R… it doesn’t work for everyone. So we use R and Excel, at a high level, as follows:

https://www.flickr.com/photos/38953955@N07/14764024879/player/
  • We cleanse and prepare data with Excel or Power Query
  • We use RODBC to load data into R
  • We analyse and verify the data in R
  • We build models in R
  • We load the data back into Excel using RODBC
  • We visualise the data for results


Excel is, after all, one of the world’s most successful software applications ever, with reputedly over one billion users. Using them both together means that you get the best of both words: R for analysis and model building: Excel is the ‘default’ for munging data around, and visualising it. I’m sure that one of the most popular buttons on software such as Tableau, QlikView et al is the ‘Export to Excel’ or ‘Export to CSV’ functionality. I’d be interested to know in what people think about that!

Building linear regression models in R is very simple; in our next session, we will look at how to do that, and then how to visualise it in Excel. Doing all this is easier than you think, and I will show you how.

The Data Analysts Toolkit Day 5: How do R and Power BI fit together?

How do R and Power BI fit together?
Technically, it is about munging data around between R and Excel and the Power BI components. 
You can use RODBC to connect to data between R and SQL Server, or R and Excel. Alternatively you can import data in.
Why else might you use R?
  • Pivot Tables are not always enough
  • Scaling Data (ScaleR)
  • R is very good at static data visualisation but Power BI and Excel are very good at dynamic data visualisation
  • You want to double check your results or do further analysis
They complement one another; they do not replace one another.
You may have heard my story about one organisation calculating the median incorrectly. 
The truth is, people don’t often check their data. I help design data warehouses all the time, and I don’t always hear people talk about reconciliation. I do hear about people racking up SSAS and diving straight in.
Just because something works technically, does not mean it is correct.
Upworthy and Facebook use R. A lot. So why not you? It is achievable.
Why R, and not some other package?
  • R most widely used data analysis software – used by 2M + data scientist, statisticians and analysts
  • Most powerful statistical programming language
  • used with RStudio, it can help you for the purposes of productivity
  • Create beautiful and unique data visualisations – as seen in New York Times, Twitter and Flowing Data
  • Thriving open-source community – leading edge of analytics research
  • Fills the talent gap – new graduates prefer R.
  • It’s fun!
Excel is used by an estimated 1.3 billion people on the planet. That sounds really impressive, until you think that many people are often using it wrong!
R just helps you to do that double check, the sanity check, to see if your data is correct.

The Data Analysts Toolkit Day 4: Getting to know R a bit better

So what actually is R? Let’s take a look at running some commands, and we can talk about it as we go along.
R is a powerful environment for statistical computing
It is like a calculator and its power is: 
… it lets you save results in variables. Let’s take a look:
> x
> y
> z = 4
> x + y + z
Ok, so we can run little commands and it gives us numbers back. What’s all the fuss about?
We can save data in files. They are saved as .RData files. R lets us import and export data in lots of different ways. If it sounds fun, it is because it is! Let’s take a look.
> someData
> save(someData, file = “D:/DataAnalystsToolkit/someData.Rdata”) 
> rm(someData) 
> load(“D:/DataAnalystsToolkit/someData.Rdata”) 
> print(someData)
Well, that is great if our data is in RData format. But normally we need to import data from different sources.
Import from CSV File
Read.csv()
MyDataFrame
print(MyDataFrame)
Alternatively we can load data in directly from RStudio. 
Go to Tools in RStudio, and select Import Dataset. 
Select the file CountryCodes.csv and select the Import button.
In RStudio, you will now see the data in the data pane.
The console window will show the following:
> #import dataset
> CountryCodes <- directory="" files="" header="F)
>   View(CountryCodes)
Once the data is imported, we can check the data.
dim(CountryCodes)
head(CountryCodes)
tail(CountryCodes) 
So how do we connect R to SQL Server?
The Package RODBC provides R with a connection to ODBC databases:
> library(RODBC) 
> myodbcConnect
myQuery
# or read query from file 
# myQuery <- ataanalyststoolkit="" myquery.sql="" nchars="99999) 
myData myodbcConnect

, myQuery, errors=TRUE) 

odbcCloseAll()
I hope that the rather bizarre colour coding will help you to see how the variables add up together to create a query.
RODBC also works for importing data from Excel files:
library(RODBC) 
filename
myxlsFile <- filename="" odbcconnectexcel="" readonly="FALSE) 
sqlSave(myxlsFile, a, rownames = FALSE) 
b
odbcCloseAll()
Next, we will look in more detail how R and PowerBI complement one another.