Statistical basics – Normal distribution Part 1
Most students can draw a normal distribution and name a…
(Can’t be bothered reading this whole post? Get the parameter estimation cheat sheet! )
Statistical textbooks often contain a paragraph like this somewhere toward the front:
“The fundamental problem in statistical estimation is to choose an appropriate
value for some population parameter, theta . The true value of theta is unknown as it is
unlikely that we will be able to observe the entire population (an example of theta
could be the population mean μ). Using a (random) sample from the population,
we choose a method to calculate an approximation of the parameter of interest
– this provides us with an estimator of the parameter. The result, based on the
observed sample using this estimator is known as the estimate for
It is quite possible to get through an introductory statistics unit and not understand parameter estimation at all. Often it is the boring first slide in a boring first lecture about sampling. But the truth is that parameter estimation is at the heart of all research. Everything we do in the lab, in a research hospital, in the field, is designed to give us the best estimate of something we are interested in, so that we might gain some insight into “the truth”. You might have heard about “bias” that can happen in research, especially medical studies where it is not easy to control differences. Bias is only a problem because it impairs our ability to estimate parameters accurately, but I’m getting a bit ahead of myself here.
For now, ‘parameter’ is just a fancy word that means ‘useful number’. Knowing the values taken by certain parameters helps us live our lives, improve our health, and do just about everything else we do. What is the weather forecast for tomorrow? That is a parameter which can take a range of reasonable values. Is it raining right now? That is a binary parameter which can take the values 0 (not raining) and 1 (raining, grab your umbrella). How much stronger is steel than iron? If I splice this gene into this mouse, what happens to the chance the mouse gets cancer? Almost everything we do in research is related to discovering the value of some parameter.
Parameters can be functions of other parameters, and this is where it gets confusing. What is the average life expectancy for patients with stage 4 melanoma? That is a parameter. What is the difference in life expectancy between patients with stage 3 and stage 4 melanoma? That is a parameter too! The first job of a statistician on a new project is often to get to the bottom of exactly what is being estimated – it is the average of one group? The difference between two groups? The slope of a regression line? They are all parameters.
When we say “this is an estimate” we usually mean “this might be wrong”. So why do we have to estimate things when we conduct research? Aren’t we collecting actual data and measurements? It is important to understand the answers to these questions. To do so, we need to take a big-picture view of what we are actually doing when we try to answer a question using observed data.
I like to tell my students that the “all-knowing statistical god” knows the values of all parameters. But we don’t. The job of the statistician is to estimate the unobservable. This is an exciting, and mysterious thing to do!
Say I am interested in the average height of 5 year old children. What is important to grasp is that somewhere out there in the universe is a number which represents the real sum of the heights of all 5 year old children that exist divided by the number of children. The number answering our question exists, but it is known only to the ‘all-knowing statistical god’, and not to us. If we knew this number we wouldn’t be bothering to conduct the research! Unless we have access to unlimited research dollars (ha) it is not going to be possible to calculate this number directly. We cannot actually measure all the children that exist and sum their heights to calculate an average. And even if we did have the resources, by the time we got to the last child the first one is probably already taller. So we have to choose a subset of children, measure them, and estimate the population average using data from our sample.
Key point: Because we have limited resources we are obliged to sample. Because we are obliged to sample we are obliged to estimate.
The sample data is our observed data. It is not the same as the actual value of the parameter we are interested in. We might have sampled from a particular neighbourhood of giant children, and so our estimate will be way off. Understanding the relationship between your research sample and your research population is very important.
I am interested in the average weight of Labrador dogs. My population is:
Hopefully you agree that my population is all labrador dogs. If I am only interested in dogs from my city I need to specify this in the aims of my study. More importantly, when I get to the exciting part of declaring my estimate for the average weight of Labradors I need to remind my readers that this estimate only applies to dogs from a certain place – it is a common mistake to forget about the population from which you have sampled. If you only sample women, your results only apply to women. If you only sampled people aged 30 – 50, you cannot apply these results to the elderly. This is what people mean when they refer to the “generalisability” of results.
We are using data about a small group of people to estimate data about a very large group of people. Does this sound ideal? Or even sensible? Once you start to think about it, this whole parameter estimation idea seems a bit strange.
Thankfully, generations of statisticians before you have literally obsessed over the tendency of sample data to give you a decent estimate of population data. The things they have discovered will be the topic of the next post.
Good luck out there,
This post has been summarised into a handy A4 cheat sheet. You’re welcome!