Simple explanation of a t-test and its meaning using Excel 2013

For those of us you say that stats are ‘dry’ – you are clearly being shown the wrong numbers! Statistics are everywhere and it’s well worth understanding what you’re talking about, when you talk about data, numbers or languages such as R or Python for data and number crunching. Statistics knowledge is extremely useful, and it is accessible when you put your brain to it!

So, for example, what does a pint of Guinness teaches us about statistics? In a visit to Ireland, Barack Obama declared that the Guinness tasted better in Ireland, and the Irish keep the ‘good stuff’ to themselves.
Can we use science to identify whether there is a difference between the enjoyment of a pint of Guinness consumed in Ireland, and pints consumed outside of Ireland?
A Journal of Food Science investigation detailed research where four investigators travelled around different countries in order to taste-test the Guinness poured inside and outside of Ireland. To do this, researchers compared the average score of pints poured inside of Ireland versus pints poured outside of Ireland.

How did they do the comparison? they used a t-test, which was devised by William Searly Gosset, who worked at the Guinness factory as a scientist, with the objective of using science to produce the perfect pint.
The t-test helps us to work out whether two sets of data are actually different.
It takes two sets of data, and calculates:

the count – the number of data points
the mean, also known as the average i.e. the sum total of the data, divided by the number of data points
The standard deviation – tells you roughly how far, on average, each number in your list of data varies from the average value of the list itself.

The t-test is more sophisticated test to tell us if those two means
of those two groups of data are different.

In the video, I go ahead and try it, using Excel formulas:

COUNT – count up the number of data points. The formula is simply COUNT, and we calculate this
AVERAGE – This is calculated using the Average Excel calculated formula.
STDEV – again, another Excel calculation, to tell us the standard deviation.
TTEST – the Excel calculation, which wants to know:

Array 1 – your first group of data
Array 2 – your second group of data
Tail – do you know if the mean of the second group will definitely by higher OR lower than the second group, and it’s only likely to go in that direction? If so, use 1. If you are not sure if it’s higher or lower, then use 2.
Type –
if your data points are related to one another in each group, use 1
if the data points in each group are unrelated to each other, and there is equal variances, use 2
if the data points in each group are unrelated to each other, and if you are unsure if there are equal variances, use 3

And that’s your result. But how do you know what it means? It produces a number, called p, which is simply the probability.

The t-test: simple way of establishing whether there are significant differences between two groups of data. It uses the null hypothesis: this is the devil’s advocate position, which says that there is no difference between the two groups It uses the sample size, the mean, and the standard deviation to produce a p value.
The lower the p value, the more likely that there is a difference in the two groups i.e. that something happened to make a difference in the two groups.

In social science, p is usually set to 5% i.e. only 1 time in 20, is the difference due to chance.
In the video, the first comparison does not have a difference ‘proven, but the second comparison does.
So next time you have a good pint of Guinness, raise your glass to statistics!

 


 

Love,
Jen

The Data Analysts Toolkit Day 3: Introduction to running and understanding commands in R

I have located a number of R Scripts over at my OneDrive account and you can head over there to download them. There are some other goodies over there too, so why not take a look?
how do you open files? To create a new file, you use the File -> New menu.
To open an existing file you use either the File -> Open menu.
Alternatively, you can use the Open Recent menu to select from recently opened files.
If you open several files within RStudio, you can see them as tabs to facilitate quick switching between open documents. 
If you have a large number of open documents, you can also navigate between them using the >> icon.
You could also use the tab bar to navigate between files, or the View -> Switch to Tab menu item.
Let’s open the Women file and take a look at the commands.
Comments in R are preceded with a hash symbol, so that is what we are using here.
# this loads the dataset
data(women)
# You can see what is in the dataset
women
# This allows you to see the column names
names(women)
# You can see the output of the height column here, in different ways
women$height
women[,1]
women[seq(1,2),]
women[1:5,1]
women[,2]
# we can start to have a little fun!
# we are going to tell R that we are going to build a model of the data
attach(women)
model
model
print(model)
mode(model)
predict(model, women, interval=”predict”)
newdata = data.frame(height=60)
newdata
predict(model, newdata, interval=”predict”)
women
r
summary(model)
Now we can see some strange output at the bottom of the page:
Call:
lm(formula = weight ~ height)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7333 -1.1333 -0.3833  0.7417  3.1167 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
height        3.45000    0.09114   37.85 1.09e-14 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.525 on 13 degrees of freedom
Multiple R-squared:  0.991, Adjusted R-squared:  0.9903 
F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14
What does this actually mean? As part of my course, I normally discuss this in detail. For those of you reading online, here is the potted summary:
Basically, you want big F, small p, at the 30,000 feet helicopter level. We will start from there, and move down as we go.
The stars are shorthand for significance levels, with the number of asterisks 
displayed according to the p-value computed. 
*** for high significance and * for low significance. 
In this case, *** indicates that there is likely to be a relationship.
Pro Tip
Is the model significant or insignificant? This is the purpose of the F statistic.
Check the F statistic first because if it is not significant, then the model doesn’t matter.
In the next Day, we will look at more commands in R.