Simple explanation of a t-test and its meaning using Excel 2013

For those of us you say that stats are ‘dry’ – you are clearly being shown the wrong numbers! Statistics are everywhere and it’s well worth understanding what you’re talking about, when you talk about data, numbers or languages such as R or Python for data and number crunching. Statistics knowledge is extremely useful, and it is accessible when you put your brain to it!

So, for example, what does a pint of Guinness teaches us about statistics? In a visit to Ireland, Barack Obama declared that the Guinness tasted better in Ireland, and the Irish keep the ‘good stuff’ to themselves.
Can we use science to identify whether there is a difference between the enjoyment of a pint of Guinness consumed in Ireland, and pints consumed outside of Ireland?
A Journal of Food Science investigation detailed research where four investigators travelled around different countries in order to taste-test the Guinness poured inside and outside of Ireland. To do this, researchers compared the average score of pints poured inside of Ireland versus pints poured outside of Ireland.

How did they do the comparison? they used a t-test, which was devised by William Searly Gosset, who worked at the Guinness factory as a scientist, with the objective of using science to produce the perfect pint.
The t-test helps us to work out whether two sets of data are actually different.
It takes two sets of data, and calculates:

the count – the number of data points
the mean, also known as the average i.e. the sum total of the data, divided by the number of data points
The standard deviation – tells you roughly how far, on average, each number in your list of data varies from the average value of the list itself.

The t-test is more sophisticated test to tell us if those two means
of those two groups of data are different.

In the video, I go ahead and try it, using Excel formulas:

COUNT – count up the number of data points. The formula is simply COUNT, and we calculate this
AVERAGE – This is calculated using the Average Excel calculated formula.
STDEV – again, another Excel calculation, to tell us the standard deviation.
TTEST – the Excel calculation, which wants to know:

Array 1 – your first group of data
Array 2 – your second group of data
Tail – do you know if the mean of the second group will definitely by higher OR lower than the second group, and it’s only likely to go in that direction? If so, use 1. If you are not sure if it’s higher or lower, then use 2.
Type –
if your data points are related to one another in each group, use 1
if the data points in each group are unrelated to each other, and there is equal variances, use 2
if the data points in each group are unrelated to each other, and if you are unsure if there are equal variances, use 3

And that’s your result. But how do you know what it means? It produces a number, called p, which is simply the probability.

The t-test: simple way of establishing whether there are significant differences between two groups of data. It uses the null hypothesis: this is the devil’s advocate position, which says that there is no difference between the two groups It uses the sample size, the mean, and the standard deviation to produce a p value.
The lower the p value, the more likely that there is a difference in the two groups i.e. that something happened to make a difference in the two groups.

In social science, p is usually set to 5% i.e. only 1 time in 20, is the difference due to chance.
In the video, the first comparison does not have a difference ‘proven, but the second comparison does.
So next time you have a good pint of Guinness, raise your glass to statistics!

 


 

Love,
Jen

Power BI and R and the role of Data Visualisation in Data Audits

In the Wizard of Oz, Toto pulls back the green curtain to expose that the Wizard of Oz is a fraud. We can peep behind the ‘green curtain’ of the data visualisation to learn how to ‘poke holes’ in the data that you are given, both in business and in everyday news headlines.

In order to explode the myths in the data that surrounds us every day, it is a little known secret that there are hidden patterns in the data chaos that surrounds us. Deviations from these patterns highlight invention, bias, anomalies and even deliberate fraud.

You can use both R and Power BI data visualisation combined with timeless data analysis and patterns such as Benford’s Law to reveal or conceal efforts to distort the numbers, and question the veracity of the data.

You’ll need courage, heart and wisdom to analyse data, since truthful data doesn’t necessarily give easy answers! My slides are here, but they are pretty bare because I don’t read off slides, but I talk around them. So here is some context.

What is Benford’s Law? Here is the Wikipedia definition: Benford’s law, also called the first-digit law, states that in lists of numbers from many (but not all) real-life sources of data, the leading digit is distributed in a specific, non-uniform way. According to this law, the first digit is 1 about 30% of the time, and larger digits occur as the leading digit with lower and lower frequency, to the point where 9 as a first digit occurs less than 5% of the time.

To detect manipulations or fraud in accounting data, people such as Mark Nigrini have successfully used Benford’s law as part of their fraud detection processes. It’s not foolproof however: Bernard Madoff filed accounts that were consistent with Benford’s Law, for example, It’s a start, and like other stats, we are dealing with probabilities rather than certainties.

It has come into light with respect to the European Union because countries need to meet the Stability and Growth Pact criteria before they can join the European Union. Therefore, countries are ‘incentivized’ to make sure that they look stable. The data regarding Greece looks strange indeed, particularly just before they joined the Euro. You can see the FT article here.

Academic evidence has found that Greece, for example, shows the greatest deviation from Benford’s Law among all the European states. Again, probabilities rather than certainties. As always, the debate rages on.

How can you use R and Power BI to look at this? Well, first of all you need a data set. If you are interested in playing with economic data, you can download it here.

R comes with a handy Benford’s Law packages for you to try as a demo:

Benford Tests

Benford.Analysis

You can visualise the data in R, of course. Alternatively, you could take the data as a csv dump, and load your data into PowerPIvot.

Once the data is in PowerPivot, you simply need the leftmost number. In my example, I took the Deficit Surplus data from Eurostat, and renamed the surplus column to Deficit Surplus. Then, we get the left hand digit. All this is straightforward enough:

=Left(ABS(‘Deficit Surplus'[Deficit Surplus]), 1)

Now, all you need to do is make this data into a Pivot table in Excel.

Overall Deficit Surplus versus Benford

You can then graph this data quite simply:
Overall Deficit Surplus versus Benford Chart
Here is the same chart in Power View:

Overall Deficit Surplus versus Benford Power View
As a next step, next time we will look at this data for different countries and years. I hope you’ll enjoy this journey with me. In the meantime, my slides are below.

The Data Analysts Toolkit Day 1: cleaning and importing your data using Power Query

What does the Data Analyst need for a Toolkit?
  • Power BI – Excel, Power Query, Power View
  • R

Power BI – cleanse, model, visualise
R – check, check, check, explore, statistics.
In this series, we will look at:
Power BI – cleanse data
Some basic R scripts
analysing and visualising data in R
Accessing data in the Hortonworks Sandbox  for analysis with R
Power BI visualisation in Excel such as the little People Graph, which I love, because it is simply fun!
So let’s get started!
Microsoft Power Query for Excel, is a new add-in that provides a seamless experience for data discovery, data transformation and enrichment for Information Workers, BI professionals and other Excel users.

With Power Query you can:

·         Identify the data you care about from the sources you work with (e.g. relational databases, Excel, text and XML files, OData feeds, web pages, Hadoop HDFS, etc.).
·         Discover relevant data from inside and outside your organization using the search capabilities within Excel.
·         Combine data from multiple, disparate data sources and shape it in order to prepare the data for further analysis in tools like Excel and Power Pivot, or visualization in tools like Power View and Power Map.
·         Share the queries that you created with others within your organization, so they can be easily found via Search.
It is part of our toolkit because it helps us to easily clean and prepare the data for consumption by R. 
So let’s get to it.

Download and install Power Query

Before we proceed with Power Query, let’s download and install the Power Query Add-in.
Before we proceed to download, let’s check the supported operating systems for Power Query. At the time of writing, these are listed below:
·         Supported Operating System
·         Windows 7, Windows 8, Windows Server 2008 R2, Windows Server 2012
·         Windows Vista (requires .NET 3.5 SP1)
·         Windows Server 2008 (requires .NET 3.5 SP1)
·         Windows 7
·         Windows 8
·         Windows 8.1
The following Office versions are supported:
·         Microsoft Office 2010 Professional Plus with Software Assurance
·         Microsoft Office 2013 Professional Plus, Office 365 ProPlus or Excel 2013 Standalone
Microsoft Power Query for Excel requires Internet Explorer 9 or greater.
Microsoft Power Query for Excel is available for 32-bit (x86) and 64-bit (x64) platforms, your selection must match architecture of the installed version of Office.
Once you have checked these items, you can download the version of the Power Query add-in that matches the architecture (x86 or x64) of your Office installation.
  • Download the Power Query for Excel add-in.
  • Open the folder where you downloaded Microsoft Power Query for Excel.
  • Double-click the downloaded installer file, and then follow the steps in the wizard.
  • After the installation is complete, click Finish.
  • Power Query will appear as a new tab in the Excel ribbon.
  • Run the MSI installer and follow the setup steps.

Enabling Power Query

  • Once Power Query is installed, you may need to enable it. To do this, go to File and then choose Options. The Excel Options dialogue box will open.
  • Go to the Add-Ins option on the left hand side.
  • At the foot of the Excel Options box, you will see an option to Manage options. In the drop down list, look for COM Add-ins. 
  • Make sure that Power Query is checked.

Connecting to External Data Sources

We have options to connect to a wide range of data sources, both public and internal.

Connecting to Web Page data

  • To open Internet Explorer, press the Windows ( ) button.
  • Then, in Start, commence typing Internet Explorer.
  • When the Internet Explorer application appears, click the application tile. 
  • Go to the Wikipedia Human Development Index page in order to see it before we import it.
  • We will now import the data to this page.

Add a Wikipedia page data source to Power Query

  • In the POWER QUERY ribbon tab, click From Web.
  • In the Web Content dialog box, in the URL text box, paste the Wikipedia URL (http://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index).
  • Click OK.
  • In the right hand side, you will see a list of Tables. Click Table 2, and you will get a preview of the data.
  • Double click on it, and you will see the data appear in the Excel window.
  • Click on it, and rename the query to HDI Very High.
  • Make sure to click on Add to the Data Model at the bottom right hand side.
  • Next, click on Apply and Close.
  • You will be returned to the Excel workbook.

We will re-run the query now for the other HDI levels: High, Medium, and Low. 
To do this, we will reuse the existing query by clicking on the Power Query tab again, and then selecting Recent Sources.
  • In the drop-down list, you will see a list of your recent sources. Select the Wikipedia page.
  • You will be returned to the Power Query Editor, and you will see the tables on the right hand side.
  • Select Table 3 by double-clicking it. Next, Make sure you have clicked on Add to the Data Model at the bottom right hand side.
  • Repeat these steps for the medium Human Development information.
  • For the Medium and Low Human Development Indexes, you will need to repeat these steps using Table 4 and Table 5 respectively.

Transforming the data
Go back to the first HDI Very High Human Development query, and we can double click on it to produce the Power Query Editor.
We can transform the HDI Very High Human Development data by removing the column Rank Change in rank between 2013 report to 2011 report[1]. It is not required.
  • To do this, highlight the column Rank Change in rank between 2013 report to 2011 report[1] and right-click to delete it. We do not need it so if we select Remove, the column will be deleted.
  • Click on Row number 1. We will make this row the header row. To do this, select Use First Row as Headers and you will see that this row is now the header row.
  • Rename the HDI column to HDI 2012 by right clicking it, and selecting Rename.
  • We will still have a row that is text. To remove it, highlight the first row and select the button Remove top Rows from the Home tab in the Power Query Editor. 
  • Change the HDI column to a Decimal Number by highlighting the column, going to Data Type and selecting Decimal Number from the drop down list.
  • Let’s rename the HDI_1 column to Difference by right-clicking it and selecting Rename.
  • Let’s add in a custom column. To do this, right click on the Difference column and select Insert Custom Column
  • In the Insert Custom Column editor, double click the HDI 2012 column to put it into the text box.
  • Then, enter a minus sign after the HDI 2012 column
  • Finally, double click the Difference column to select it and put it into the text box.
  • Your query should look like this:
  • [HDI 2012] – [Difference]
  • Press the return button. Your final query output should appear as follows.
  • Now click Apply and Close.

Now, you can see in Excel we have five columns:
  • Rank
  • Country
  • HDI 2012
  • Difference
  • HDI 2010

Two of the columns contain HDI data, which is split by date. It will be easier to visualise the data if all of the HDI data is in the same column. Therefore, we can unpivot the columns so that it is easier to visualise.
  • To do this, select the HDI 2010, HDI 2012 and the Difference columns.
  • Go to the Transform tab and go to the option Unpivot Columns.
  • You will now see that the HDI 2010, HDI 2012 and the Difference columns would go in the same column, which is headed Value.
  • The attribute names have gone into a column called Attribute.
  • The other two columns, Rank and Country, have stayed the same.
  • We don’t actually need the Difference column now that we have obtained the data, so let’s filter it out.
  • To do this, go to the Attribute column heading and click the downward arrow next to it.
  • Deselect the Select All button, and select only the HDI 2010 and HDI 2012 options, and then click OK.

Now we can rename the attributes HDI 2010 and HDI 2012 so that it contains the year name only. To do this, select the Attribute column and go to the Transform tab.
  • Go to the option Replace Values and the Replace Values box will appear.
  • In the Value to Find box, enter HDI 2012.
  • In the Replace With box, enter 2012 and then click OK.
  • Repeat these steps for HDI 2010, renaming it to 2010.
  • Let’s rename the Attribute column to year. To do this, select the column and then right click. In the pop up menu that appears, select Rename. Rename the column to Year.
  • Now, let’s rename the Value column so that it is now called HDI Value. To do this, select the column and right click to get the pop up menu again. Look for the rename option, and rename the column to HDI Value.
  • If we examine the Rank column, we can see that some countries and jointly ranked. So, for example, Hong Kong and Iceland are jointly ranked 13th top in the list.
  • Let’s add an index to our Power Query. We could use this later as a key in the table.
  • To do this, right click on HDI Data and click Edit.
  • The Power Query Editor will appear again. Go to the Add Column tab in the Power Query Editor and select the option Insert Index Column.
  • The Index column will appear at the right hand side. To move it, select the column and go to the Move option.
  • Look for the option ‘To the Beginning’ and select it. The Index column will move to the left hand side of the table.
  • Once you have done this, go to the Home Tab and click on the Apply and Close button.

Connecting to Online data

 Let’s do an online search. This is a fun part of Power Query.
Before we dive in to use the Editor, let’s take a look at some of the features in the Power Query Editor.
You can see that there are four main tabs: Home, Transform, Add Column and View.
The Home tab contains useful functionality which are common when we are cleansing data.
Sometimes data files contain unnecessary data when they are exported from other sources. This involves specifying to Power Query which data you want to keep, not which data you want to throw away.
One example of this unnecessary data might include headers, titles, dates, owner information and so on. We can remove these rows by removing the top number of rows from the data set.
  • Click on the Power Query tab.
  • Click on the Online Search button.
  • The Online Search box will appear on the right hand side.
  • Type in past and future GDP in the search box and click return.
  • The search box will show two tabs: one is organisation and the other is public. If you click on organisation, you will probably see no data. If you click on public, there will be many data sources to choose from.

Extracting data from the Azure Data Market.
In the real world, however, we need to be able to connect to different data sources, which may contain large amounts of data.
We will look at connecting to multiple data sources at a time. This is a useful way of enriching our data., which is another way of describing data. In this definition, it is a  field that can be considered an independent variable, regardless of the data type. 

To connect the online data and local data, we will connect to Windows Azure Datamarket using Odata, which is a standardized protocol to provide CRUD (create, read, update, delete) access to a data source via a website. It is the data API for Microsoft Azure, but other organizations use it as well, such as eBay, SAP and IBM.

Before you start, you need to create a folder where you can download data for the purposes of running through the examples. You should pick a folder name that is meaningful for you. Also, be sure to select a location that has plenty of space. In this example, we will use the following location to store data:
D:\Data\DataAnalystToolkit

To connect to Windows Azure Datamarket, please sign up for a free account using a Windows Live ID. To do this, please visit https://datamarket.azure.com/ and follow the instructions. This may involve activating your account via a link, so please follow the instructions carefully.

Sign in to the Windows Azure Datamarket and navigate to the following URL:

About half way down the page, look for the Sign Up button and select it.

  • This will take you to a Terms and Conditions page. After you’ve read the Terms and Conditions, and, if you agree with them, tick the box to specify that you agree, and click on Sign Up.
  • This will take you to a Thank You page. Look for the link to Explore this Dataset on this page, and click it.



Explore this Dataset does not appear initially appear on the screen. You need to click the Subscribe Button (even after logging in) to activate that section.

When you click ’Explore this Dataset’, you will be able to see the data appear in the browser, which you can slice and dice. 

 In this example, we will load the data into a CSV file, rather than in the Data Explorer URL. To do this, we need the Primary Account Key.  In Windows Azure Datamarket, this is easy to obtain. From the above example, we can see a feature called ‘Primary Account Key’. If you click on the ‘Show’ link next to ‘Primary Account Key’ then your Primary Account Key will appear.

You can download the data, and import it into Power Query.

Open the Power Query editor, and go to the Power Query tab.
Look for the button From Other Sources.
Look for the option From Windows Azure MarketPlace and click on it
If you are signed in to Power Query already with a Power BI subscription, you will see your subscribed data feeds on the right hand side. Clever, huh?
If not, selecting this optin will give you the Data Explorer credential manager.

Once you are logged in, you can see your data sets and you can import them into Power Query.

I hope that you enjoyed this whizz tour of Power Query. On to the next step!