Hypothesis Testing 1 – Writing a hypothesis
Hang onto your hats, statfans! Hopefully you’ve read the rest…
Most students can draw a normal distribution and name a couple of its properties, but in order to really ‘get’ statistical inference you need a solid understanding. Let’s start by talking about what we mean by the word ‘distribution’.
No one will tell you this in an introductory stats course, but when we say ‘the distribution of the data’ we could mean one of two things. We could be talking about a scatter plot or histogram of actual observations from a real experiment, like this histogram of chicken weights (data included with R):
These are the actual weights of real chickens that exist. Most chickens in this sample weigh somewhere between 200 & 350 grams, with a few weighing a lot less, and a few bigger chickens, but no outliers. This is real data so it is not perfectly symmetrical, and the most common value (‘the mode’) is not in the centre of the distribution.
But when we say “the distribution” we might be actually talking about an idealised distribution which is close enough to our data, and very useful statistically. This second kind of ‘distribution’ is a line defined by a mathematical equation. A straight line might represent a simple distribution, and we can specify a straight line with the mathematical equation y = mx + c. This just means that y (the vertical dimension) is equal to some number (m) times the horizontal dimension (x), plus some constant (c).
Let’s begin with some values for the horizontal dimension (x), below:
To make a line, we can produce the corresponding values for the vertical dimension (y) using an equation of the form y = mx + c. Let’s use y = 2x + 1: ( 3 = two times 1, plus 1; 5 = two times 2, plus 1 ; 7 = two times 3, plus 1, etc)
Now we have our coordinates. Do they make a line?
They do! This line might represent the relationship between all kinds of things, like the number of cars your company owns (x) and total car insurance costs (y).
Straight lines are handy, but a lot of the time we want an idealised ‘normal distribution‘, which looks like this:
This is not real data from an experiment. It is a fancy line determined by an equation; like y=mx+c but more bendy.
Key points about the normal distribution:
Just like the equation for a straight line, to get y we need to start with x and then do something to it. Hang onto your hats because here comes the equation defining the recipe for a normal distribution:
Do not panic. It is just a fancy line. I’m not going to get into the weeds manipulating this equation but I do want to identify the “ingredients” for a normal distribution, so that you know when you have one. These ‘ingredients’ are the “unknowns” in the equation above. Below I have crossed out all the constants.
We are left with only the squiggly thing and the u looking thing. What are they? They are the parameters of the normal distribution! The squiggly one is pronounced ‘sigma’ – it is the standard deviation. The u thing is pronounced ‘mu’ and it is the mean. The game is actually given away by the left hand side of the equation, which says (in math language) “here is a recipe relating x and y, which by the way depends on the values taken by the parameters mu and sigma.”
So what? This means the form of any normal distribution depends on just two things, the mean (µ, ‘mu’) and the standard deviation (σ, ‘sigma’) – if you have those two parameters then you have the whole distribution, because they are all you need to define the entire distribution as per scary equation above. This means we don’t need email each other pictures of our favourite distributions, we can just report the mean and standard deviation and anyone today, or in 5 years, or in 500 years can use µ and σ to reconstruct the whole distribution. This is cool! And very convenient!
But beware: once you report a summary of your data using the mean and standard deviation, you are dealing in idealised distributions and not actual data anymore. This is a subtlety that becomes quite important later.
Let’s construct a normal distribution using the chicken data above. The average weight of one of these chickens is 261 grams, I have also calculated the standard deviation as 78 grams:
Even though the real distribution is kind of ‘normal’ in shape it is not perfectly normal.The idealised distribution (top one) is symmetrical, and the most common value is in the centre. So when we move to the idealised distribution we have changed the shape of the data a bit.
When do we ‘move to an idealised distribution?’ It is very common in journal articles to report the mean and standard deviation of numerical data in a table, and not to display the full distribution warts and all. Now that you’ve read this post you understand that you can use this mean and standard deviation to reconstitute the whole distribution, which is super handy, but not the same as getting back to the raw data. Reporting the mean and standard deviation cannot tell your readers what the original distribution of observations looked like. The weights of chickens are not very important, but if we’re talking about the distribution of survival times for cancer patients we need to be more careful.
The normal distribution has some other properties which are crucial for hypothesis testing and procedures of statistical inference. I’ll deal with that in the next post.
Good luck out there!