Sample Size and Power Calculations Explained

This kind of question about sample size comes up a lot:

“I was told by my advisor to conduct a sample size / power analysis, because we want to state how many subjects we will need… I approached a statistical consultant at my university, and he told me that I needed to run a pilot study… I really don’t have the time or money to do that … how in the world is this done, ever? Can someone tell me how I should proceed? I really just want to know how many subjects to say I will need in my proposal.”

So how in the world is this done, ever? I think the best way to learn about statistical power and required sample size is to use an analogy about looking for things in your own house:

Imagine you are in your kitchen and you need a hammer for something. You have a 5 year old son,  so you say “Could you go to the basement and get the hammer for me?”

After a while he comes back and says “It’s not there”. As an adult you know that there are really two possibilities here – either the hammer is down there and he missed it, or the hammer really isn’t there. Statistical power is all about managing these two probabilities in your experiments,  because experiments in epidemiology are really just hunts in the basement for things like “differences between groups” or “evidence this exposure is related to that outcome”.

As a scientist you need to be able to unpack what is meant by the statement “It’s not there”. Did your experiment fail to detect something? Or is there nothing to detect? We shouldn’t be concerned about experiments which find no result because there is really no result. But how can we design an experiment that is safe-guarded against missing a result?

There are four ingredients to any power analysis, and they apply to looking for hammers, too:

1.  Effect size
2.  Sample size (N)
3.  Alpha significance criterion (α)
4.  Statistical power, or implied beta (β)
Effect Size: How big is this thing we’re looking for? And how messy are things?

If we’re searching for a giant novelty 10ft hammer then we don’t have much to worry about. Similarly, if we are expecting a treatment to produce a huge effect then we shouldn’t have to work too hard to find it with an experiment. “Huge treatment effect” translates to “big gap between the peaks” on the figure below.

If these red and blue distributions represent what is really going on this is a nice tidy situation, there is fresh air between the means of these distributions. The groups don’t overlap a lot so we won’t need to supercharge an experiment in order to detect a difference that looks like the one above. Compared to the figure below which depicts a much more difficult situation.

It also matters how tidy your basement is, because looking for something when things are all on top of each other is not easy.

If we’re looking for a difference in a continuous measure (like weight, blood pressure, or some other thing) we also need to consider the degree of messiness going on. This relates to the variance of the thing you’re measuring.

In summary, if your experiment is looking for a difference between groups you need to have a sense of how much the distributions are overlapping. They might overlap because the effect size is small (looking for small thing: tiny hammer) or because the variance of your measurement is large (identifying things is hard: messy basement).

It is important to understand that in practice neither the effect size or shape of relevant distributions is actually knowable by you (otherwise why bother doing the experiment?) We have to take a guess at what the distributions are likely to look like – by talking to experts, or looking for studies that have measured something similar. Another popular approach is to say “what is the smallest difference I care about?” and acknowledge that any difference smaller than that will probably be missed by your study. For example, if it turns out that a treatment increases life expectancy by 1 day on average this might not be worth discovering, but we might want to pick up a difference of one month, or one year.

Sample size (N)

Thanks to the Central Limit Theorem (post on that coming up) – the bigger the sample, the better job it does at revealing the truth to you. This is sort of analogous to the amount of time you’re spending in the basement looking for that hammer. If you just pop down for two minutes you’re less likely to find it than if you spend 3 hours down there. Having more people in your study is akin to a longer search of the basement. Usually sample size is the unknown in the equation – you need to know how many subjects to use in order to have a good chance at discovering a result. But not always, sometimes sample size is fixed and we need to know how much statistical power we have access to. For example if you have a tissue bank of 15 donor brains then you have 15 brains and that is it – the question becomes “Can I get acceptable power given the sample size I have?”

Alpha significance criterion (α)

Ok this is where things get a bit metaphysical. In every study there is a chance that we’re going to conclude (incorrectly) that we found something. We’re going to say “I found the hammer!” but what I really found was something else, maybe a wrench. We’re going to say “this drug extends survival!” but really we managed to sample people (by chance) so that survival was worse in the placebo group. The drug actually does nothing.

‘Alpha’ is the proportion of times we do an experiment designed exactly like this one that we will “find something” that isn’t really there. It is a long-run frequency. If we were to repeat this experiment 100 times, if alpha = 0.05 that means we are comfortable that 5/100 times we find something that isn’t there. If we want more certainty (want to reduce alpha to < 0.05 %) sample size will need to increase. Alpha is usually 0.05, but not always.

Statistical power, or implied beta (β)

Just as we can find a result that doesn’t really exist, we can miss a result that does exist (not find the hammer in the basement). Again imagining a scenario where we repeat the experiment 100 times,  the proportion of times we fail to find a result that exists is beta (β). Therefore, by simple arithmatic, the proportion of times we make a correct conclusion = 1 –  β. We call (1 –  β) “Statistical Power”. Power of 80% (beta = 0.20) is normal in most fields.

Can’t we set alpha and beta to zero so that we never make a mistake?

Unfortunately there is a ‘trade off’ between preventing false positives and producing false negatives.  Understanding why requires a reasonably solid understanding of how sampling distributions work under different hypotheses, so I’m going to leave that explanation for another time. But the intuition is pretty simple; the more time we spend in the basement looking for stuff (trying to avoid missing things that are really there), the more likely it becomes we’re going to find something that isn’t real. After a while everything down there starts to look like a hammer!

So, to discover how many subjects to include, you need to decide the smallest effect size you want to detect, make an assessment of the variation in the thing you’re measuring, think about what kind of probability you can accept re: finding something that is not real, and the kind of probability you want re: missing something that is real. Then you need to plug all this into a calculator (see below),  statistical package, or human with statistical know-how. The exact formula will depend on the kind of study you are designing.

Good luck out there!

Taya

Calculator for difference of group means

Calculator for a single population mean

R code for the plots is here:

ggplot(data.frame(x = c(-7, 7)), aes(x)) +
stat_function(fun = dnorm, colour= “cornflowerblue”, args = list(mean = 2, sd = 1)) +
stat_function(fun = dnorm, colour= “rosybrown”, args = list(mean = 0, sd = 1)) + theme_classic()

Article Tags
Related article

6 Discussion to this post

1. Martijn says:

Really like the hammer analogy! But I’m having trouble with the analogy where the tradeoff between alpha and beta is concerned, though – the analogy that everything starts to look like a hammer when you’ve been searching long enough seems to refer to a large N (long search time) rather than a large beta (include things that might look like a hammer in dim basement lighting, but aren’t in fact).

• Taya says:

Thanks Martijn. It does get a bit tenuous at that point. Without directly analysing the sampling distributions under the null and alternative hypotheses it is not going to be clear how alpha and beta are connected, but I wanted to keep this post reasonably jargon free. Inspecting figures like this one shows what is going on with the probability densities. The dark shaded area is alpha and the light shaded area is beta, it is clear from this figure that the only way to decrease alpha is to move c to the right, which has the effect of increasing beta. Once I have a post on sampling distributions I’ll update this one with some more detail.

2. Kees Grashoff says:

Zet ‘m op Taya!

3. Biddy says:

Fantastic post and very helpful. I’m just getting started and appreciate the bite sized explanations.

4. Katherine Barnett says:

I’m getting it. Slowly but surely, I’m getting it. Reading your posts after each chapter in my text is like turning on a light bulb! Do more soon! More! More! I love your examples!

• Taya says:

Thanks Katherine! Just for you I’ve published a post on writing hypotheses today. Please contact me if you need help clarifying anything.

Type your search keyword, and press enter to search