Statistical basics – Normal distribution Part 1
Most students can draw a normal distribution and name a…
Understanding sampling distributions is really important, but very few researchers do! In the sample size post I introduced the idea that we look at the results of an experiment by investigating what would happen if we did the same experiment lots of times. To do that we need to use something called the ‘Sampling distribution’ – let’s go!
Hopefully you remember that we are in the business of estimating parameters. We use sample data to estimate something about a wider population of people or things. We are obliged to sample because we do not have infinite dollars, and because we are obliged to sample we are obliged to estimate. Say I am interested in average survival time measured in months for people with a particular cancer. The starting point is the distribution of survival time (in months) in the study population, which might look like this:
(All screengrabs are from the fantastic simulation at onlinestatsbook.com)
This is a nasty distribution, it doesn’t look like any kind of distribution I know. The average survival time for this pretend cancer, “cancer X”, is around 9.5 months. In the real world we don’t know the true value of average survival time (our ‘population parameter’), if we did we wouldn’t need to bother doing the study!
Our study is going to take one sample from the population pictured above. We will be picking some individuals from the population of “all individuals with cancer X” to provide an estimate of the parameter we care about, average survival time. How do we know if the estimate is any good? How do we guard against picking people right at the edge of the distribution, who are not really representative of everyone else?
In order to interpret the results of our experiment we need to be able to say something intelligent about how this parameter (average survival) would look if we did our experiment thousands of times. Another way to say this is that when we are trying to estimate a particular statistic, we need to know something about the distribution of that statistic on repeated sampling. The sampling distribution is a distribution of a statistic, calculated multiple times from different samples from the same population.
Hugely important summary: Our goal is to estimate the value of the statistic “average survival time for patients with cancer X”. To do that, we need to think about the sampling distribution of that statistic. We were originally talking about the average survival time in a population of patients with cancer, now we are talking about the distribution of the sample average if we were to take a bunch of samples from that population.
How can we get our heads around this? Thanks to computers we can simulate:
The top picture is our old friend, the distribution of survival time for patients with cancer X. The middle graph is a random sample of 5 observations taken from that distribution. The bottom graph plots the average of those 5 observations, this becomes a part of the sampling distribution (it has just one observation in it right now).
If I keep repeating this process, after 20 samples the sampling distribution looks like this:
The real average survival time for the population is about 9.5 months. The mean of this distribution (of means) is just over 6 months. So far we are not doing an awesome job of estimating the population mean. What if I did 10,000 samples?
Take my word for it, this is very exciting. Once we take a heap of samples, the mean of the sampling distribution (9.56) gets extremely close to the true population mean (9.53). You should head over to onlinestatbook.com and verify this result yourself.
Another interesting thing has happened. The parent population (black one) is skewed, asymmetrical and weird. But the sampling distribution (blue one) is symmetrical and nicely bell-shaped – it is a normal distribution. The normal distribution has a lot of very handy properties which allow us to learn things about our experimental results.
For this demonstration I took samples of just 5 patients, which is a really tiny sample. If you can take larger samples, the sampling distribution approaches the true mean faster and looks more like a normal distribution.
As we take more and more samples from a population, the average of the samples gets closer and closer to the population average.
As we take more and more samples, the sampling distribution looks more and more like the classic normal distribution.
Taking larger samples gets you a better sampling distribution
This is such an important phenomenon that statisticians have given it a special name, we call it the Central Limit Theorem.
Whether or not students understand statistical inference tends to hinge on whether or not they understand the content of this post. Take your time, these ideas are central to all statistical inference.
Soon we’re going to use the central limit theorem to look at an experiment, and learn about the most confusing statistic of all time; the p value.
Until then try this easy quiz on sampling distributions.
Good luck out there!