The post Statistical basics – Normal distribution Part 1 appeared first on Survive Statistics.

]]>No one will tell you this in an introductory stats course, but when we say ‘the distribution of the data’ we could mean one of two things. We could be talking about a scatter plot or histogram of *actual observations *from a real experiment*, *like this histogram of chicken weights (data included with R):

These are the actual weights of real chickens that exist. Most chickens in this sample weigh somewhere between 200 & 350 grams, with a few weighing a lot less, and a few bigger chickens, but no outliers. This is real data so it is not perfectly symmetrical, and the most common value (‘the mode’) is not in the centre of the distribution.

But when we say “the distribution” we might be actually talking about an *idealised distribution* which is close enough to our data, and very useful statistically. This second kind of ‘distribution’ is **a line** **defined by a mathematical equation.** A straight line might represent a simple distribution, and we can specify a straight line with the mathematical equation y = mx + c. This just means that y (the vertical dimension) is equal to some number (m) times the horizontal dimension (x), plus some constant (c).

Let’s begin with some values for the horizontal dimension (x), below:

To make a line, we can produce the corresponding values for the vertical dimension (y) using an equation of the form y = mx + c. Let’s use y = 2x + 1: ( 3 = two times 1, plus 1; 5 = two times 2, plus 1 ; 7 = two times 3, plus 1, etc)

Now we have our coordinates. Do they make a line?

They do! This line might represent the relationship between all kinds of things, like the number of cars your company owns (x) and total car insurance costs (y).

Straight lines are handy, but a lot of the time we want an idealised ‘*normal distribution*‘, which looks like this:

This is not real data from an experiment. It is a fancy line determined by an equation; like y=mx+c but more bendy.

Key points about the normal distribution:

- It is perfectly symmetrical
- The mean, median (middle number) and mode (most common number) are the same number
- The mean/median/mode are in the centre of the distribution – this is why we often say that something is normally distributed “about” the mean. (As in, ‘she put her arms
*about*him’ not – ‘this post is*about*statistics’

**This is thrilling – how do I make a normal distribution?**

Just like the equation for a straight line, to get y we need to start with x and then do something to it. Hang onto your hats because here comes the equation defining the recipe for a normal distribution:

Do not panic. It is just a fancy line. I’m not going to get into the weeds manipulating this equation but I do want to identify the “ingredients” for a normal distribution, so that you know when you have one. These ‘ingredients’ are the “unknowns” in the equation above. Below I have crossed out all the constants.

We are left with only the squiggly thing and the u looking thing. What are they? They are the parameters of the normal distribution! The squiggly one is pronounced ‘sigma’ – it is the standard deviation. The u thing is pronounced ‘mu’ and it is the mean. The game is actually given away by the left hand side of the equation, which says (in math language) “here is a recipe relating x and y, which by the way depends on the values taken by the parameters mu and sigma.”

So what? This means the form of any normal distribution depends on just two things, the mean (µ, ‘mu’) and the standard deviation (σ, ‘sigma’) –* if you have those two parameters then you have the whole distribution, *because they are all you need to define the entire distribution as per scary equation above. This means we don’t need email each other pictures of our favourite distributions, we can just report the mean and standard deviation and anyone today, or in 5 years, or in 500 years can use µ and σ to reconstruct the whole distribution. This is cool! And very convenient!

But beware: once you report a summary of your data using the mean and standard deviation, you are dealing in *idealised distributions and not actual data anymore*. This is a subtlety that becomes quite important later.

Let’s construct a normal distribution using the chicken data above. The average weight of one of these chickens is 261 grams, I have also calculated the standard deviation as 78 grams:

Even though the real distribution is kind of ‘normal’ in shape it is not perfectly normal.The idealised distribution (top one) is symmetrical, and the most common value is in the centre. So when we move to the idealised distribution we have changed the shape of the data a bit.

When do we ‘move to an idealised distribution?’ It is very common in journal articles to report the mean and standard deviation of numerical data in a table, and not to display the full distribution warts and all. Now that you’ve read this post you understand that you can use this mean and standard deviation to reconstitute the whole distribution, which is super handy, but not the same as getting back to the raw data. Reporting the mean and standard deviation cannot tell your readers what the original distribution of observations looked like. The weights of chickens are not very important, but if we’re talking about the distribution of survival times for cancer patients we need to be more careful.

The normal distribution has some other properties which are crucial for hypothesis testing and procedures of statistical inference. I’ll deal with that in the next post.

Good luck out there!

Taya

The post Statistical basics – Normal distribution Part 1 appeared first on Survive Statistics.

]]>The post Hypothesis Testing 1 – Writing a hypothesis appeared first on Survive Statistics.

]]>Statisticians get very worked up about hypothesis testing. Things have to be done in a very particular way and there’s a million opportunities for you to stuff it up. Get excited, because here comes my complete guide to hypothesis testing. You’re welcome! This first part is about writing hypotheses.

A ‘hypothesis’ is some statement about how the world works. I might have a hypothesis that “Women weigh less than men, on average”, or “Aliens from space are controlling my brain”. A hypothesis has to be a statement about how things actually *are*, or might be*. “*People should not murder each other” is not a hypothesis, it’s a statement about how someone thinks the world *should* be . My golden rule to keep you out of trouble is that **every hypothesis is a statement beginning with the word ‘that’**, like these:

- That giraffes are taller than monkeys, on average
- That elderly people sleep less than middle-aged people, on average
- That this drug I have improves average cancer survival times
- That this drug I have does not improve cancer survival times
- That this new drug is no different to this other drug

A hypothesis should only contain *one* ‘that’ – it might sound obvious, but if you have multiple ‘thats’ you have multiple hypotheses: ‘That my brother is older than me and that he likes to play golf ‘ is two hypotheses. Your hypothesis need to be bite sized; able to be tested and digested in a single experiment.

Your hypotheses need to be **falsifiable. **What does this mean? It means that there has to be some possible way of unearthing evidence which debunks it, or proves it false. With this in mind, my hypothesis about aliens controlling my brain is no good – there is no experiment we can do (with current technology!) which will disprove this hypothesis. If you can’t disprove a statement, it doesn’t count as a scientific hypothesis.

Your hypotheses needs to be **precise in two ways. **“That pizza tastes good” is simply not going to cut it. “Good” is not a thing we can test with an experiment. You need to use very clear language, so subjective terms like ‘good’ and ‘bad’ are out. Usually, you also need to say *who* you’re talking about. It is very unusual that your research population is all of humanity, so include the population in the hypothesis.

A better hypothesis about pizza is: “That pizza is preferred to hot dogs by middle aged, single Scottish men”. This hypothesis *suggests the experiment which could test it*, which is the hallmark of a very precise hypothesis.

You need to avoid **causal overtones.** Unless you are a randomised, experimental study you are not allowed to suggest that some thing is *causing* some other thing. Ever. Students are often trying to make their hypotheses sound interesting, or important, and so unintentionally introduce language which suggests they are hunting for causal relationships. I have taken a lot of marks off students for this simple mistake, it is very easy to do by accident. Here are some examples:

- That drinking coffee causes headaches among middle aged women
- That having a baby increases the chances you will not graduate from university
- That depression leads to poor performance on undergraduate exams

None of this language – or any other variation implying that something is the direct result of something else – is ok in any scientific discipline. Weed it out.** **Here are the improved versions:

- That coffee consumption is associated with higher frequency of headache among middle aged women
- That women who become mothers during undergraduate study graduate at lower rates than women without children
- That depression is associated with poor performance on undergraduate exams

Why do we do this hypothesis hose-down? Science is inherently *conservative* in its statements about how the world works. We know that things which are correlated aren’t always caused by each other, and so to avoid making errors we have set up a fancy list of things you need to do in order to be allowed to claim that the relationship you’ve found is causal. This skepticism is a big part of thinking like a scientist.

Ok, so you know what’s what with hypotheses – congratulations! In the next post we’ll meet the enemy of students everywhere – the “Null Hypothesis.”

The post Hypothesis Testing 1 – Writing a hypothesis appeared first on Survive Statistics.

]]>The post Sampling Distribution & Central Limit Theorem appeared first on Survive Statistics.

]]>Hopefully you remember that we are in the business of estimating parameters. We use sample data to estimate something about a wider population of people or things. We are obliged to sample because we do not have infinite dollars, and because we are obliged to sample we are obliged to estimate. Say I am interested in average survival time measured in months for people with a particular cancer. The starting point is the distribution of survival time (in months) in the study population, which might look like this:

(All screengrabs are from the fantastic simulation at onlinestatsbook.com)

This is a nasty distribution, it doesn’t look like any kind of distribution I know. The average survival time for this pretend cancer, “cancer X”, is around 9.5 months. In the real world we don’t know the true value of average survival time (our ‘population parameter’), if we did we wouldn’t need to bother doing the study!

Our study is going to take **one** sample from the population pictured above. We will be picking some individuals from the population of “all individuals with cancer X” to provide an *estimate* of the parameter we care about, average survival time. How do we know if the estimate is any good? How do we guard against picking people right at the edge of the distribution, who are not really representative of everyone else?

In order to interpret the results of our experiment we need to be able to say something intelligent about how this parameter (average survival) would look if we did our experiment **thousands of times**. Another way to say this is that when we are trying to estimate a particular statistic, we need to know something about the distribution of that statistic on repeated sampling. **The sampling distribution is a distribution of a statistic**, calculated multiple times from different samples from the same population.

Hugely important summary: Our goal is to estimate the value of the statistic “average survival time for patients with cancer X”. To do that, we need to think about the sampling distribution of that statistic. *We were originally talking about the average survival time in a population of patients with cancer, now we are talking about the distribution of the sample average if we were to take a bunch of samples from that population.*

How can we get our heads around this? Thanks to computers we can simulate:

The top picture is our old friend, the distribution of survival time for patients with cancer X. The middle graph is a random sample of 5 observations taken from that distribution. The bottom graph plots the average of those 5 observations, this becomes a part of the sampling distribution (it has just one observation in it right now).

If I keep repeating this process, after 20 samples the sampling distribution looks like this:

The real average survival time for the population is about 9.5 months. The mean of this distribution (of means) is just over 6 months. So far we are not doing an awesome job of estimating the population mean. What if I did 10,000 samples?

Take my word for it, this is very exciting. Once we take a heap of samples, the mean of the sampling distribution (9.56) gets extremely close to the true population mean (9.53). You should head over to onlinestatbook.com and verify this result yourself.

Another interesting thing has happened. The parent population (black one) is skewed, asymmetrical and weird. But the sampling distribution (blue one) is symmetrical and nicely bell-shaped – it is a *normal distribution*. The normal distribution has a lot of very handy properties which allow us to learn things about our experimental results.

For this demonstration I took samples of just 5 patients, which is a really tiny sample. If you can take larger samples, the sampling distribution approaches the true mean faster and looks more like a normal distribution.

As we take more and more samples from a population, the* average of the samples* gets closer and closer to the population average.

As we take more and more samples, the sampling distribution looks more and more like the classic normal distribution.

Taking larger samples gets you a better sampling distribution

This is such an important phenomenon that statisticians have given it a special name, we call it the **Central Limit Theorem**.

Whether or not students understand statistical inference tends to hinge on whether or not they understand the content of this post. Take your time, these ideas are central to all statistical inference.

Soon we’re going to use the central limit theorem to look at an experiment, and learn about the most confusing statistic of all time; the p value.

Until then try this easy quiz on sampling distributions.

Good luck out there!

Taya

The post Sampling Distribution & Central Limit Theorem appeared first on Survive Statistics.

]]>The post Statistical Basics – What is a Random Variable? appeared first on Survive Statistics.

]]>If I toss a normal coin there is a 50% chance it will land on ‘tails’. This underlying probability applies no matter how many times I toss the coin. Tossing coins is boring, the important part to remember is that for variables like the outcome of a coin toss, what we call “chance” is fully defined mathematically as being 50% heads and 50% tails. When all the possibilities are defined neatly like this we say we have a Probability Distribution. The distribution for tossing a coin looks like this:

Note that the options sum up to 100%

Each time we toss the coin we are really drawing an observation from the above probability distribution. Statisticians like to call this a “trial”. Trial means “consult the distribution and see what we get”.

If I randomly select a person from the Australian population, there is a 50.2% chance that it is a female person. Just like coin tossing, each time I select a person I am going to the underlying probability distribution (50.2% Female person / 49.8% male person) and asking for it to spit out an “observation”. The data that ends up in my imaginary spreadsheet about gender is determined by the underlying probability distribution. Any variable that operates like this is called a “random variable”.

Understanding that what we observe in the world is the result of underlying probability distributions is a big moment in your stats journey, so let this sink in a bit.

When you line up at the bakery to get something yummy like a doughnuts (I love doughnuts!) sometimes you have to take a ticket at the counter. If we asked a bunch of customers at the bakery “What is the number on your ticket?” there would be different response for each customer, so this is a variable (numerical / discrete).

The number on the ticket is *not* drawn from a probability distribution. It is not random in any way. If we were to take 100 tickets in a row we are not consulting the probability distribution each time – we *know* which ticket is next because it is the value of the current ticket + 1, that’s the whole point of the ticket system! The hallmark of a non-random variable is that you know what you are getting before you get it.

Almost all variables we use in research can be considered “random” – but it is very important you spot the things that are determined entirely by some formula or equation, because they’re a different kettle of doughnuts.

Good luck out there,

Taya

The post Statistical Basics – What is a Random Variable? appeared first on Survive Statistics.

]]>The post Statistical Basics – Variables appeared first on Survive Statistics.

]]>Here we have two variables, “Name” and “Age”, containing the answers to “What is your name?” and “What is your age?” Don’t forget that the name of a variable is not the whole story – “Age” could mean age today, age at the time of a cancer diagnosis or the age of your car. Make sure you understand the real question each variable is answering.

So; ask a question, get a variable? Not quite. If the question I asked was “What is *my* name?” The answer would be “Taya” no matter how many people I ask. My name is not a variable, it is a *constant*.

So whether something is a variable depends on the question you are asking. The answer to “In what order do the days of the week occur?” is a constant – thank goodness – Wednesday always comes after Tuesday. But the answer to “What day is it today?” is a variable (if the study goes for more than one day) because the answer will change depending on when the question is asked.

We have developed special names for certain kinds of variables to frustrate statistics students.

If the answer to your question is a number (“What is your shoe size?”) then the variable is numerical, it is made of numbers. If the answer is a word (“What kind of dog do you have?”) you have categorical data. Don’t stress about this terminology too much.

I cannot directly analyse “Today is Tuesday” or “Andrew has blue eyes” with statistical techniques. If it is not a number already, each answer to our question needs to be converted into a number in order for us to analyse it. This conversion process is called “coding”. How a variable is coded depends on the kind of data in it. Let’s look at a few variables to see how this works.

Categorical Data (Data that is not numbers)

**Ordinal Variable** – Annoying surveys often ask you to answer with the options “Strongly Disagree”, “Disagree”, “Neutral”, “Agree” or “Strongly agree”. This data has a special structure, because if these are coded 0 “Strongly Disagree” to 4 “Strongly agree”;

0 = Strongly Disagree

1 = Disagree

2 = Neutral

3 = Agree

4 = Strongly agree

The numerical coding reflects *a real hierarchy in the data*. 4 is greater than 3; “Strongly Agree” is greater than “Agree”. It’s beautiful! What we never do is something like this:

0 = Strongly Disagree

2 = Disagree

1 = Neutral

4 = Agree

5 = Strongly agree

This is no good. The numerical coding has destroyed the hierarchy present in the data.

**Nominal Variable **– Sometimes there is no hierarchy in categorical data. If eye colour was coded 0 “Blue” 1 “Green” 2 “Brown” we have to randomly choose which option gets which number. It doesn’t matter whether Blue eyes is zero, or one, or two, because there is no hierarchy in eye colour.

Numerical Data (Data that is numbers)

**Continuous Variable** – This concept is tough, but you’re going to be fine because we’ve already set up the concept of a variable as *the answer to a question*.

Some questions have got *a lot* of answers. If you asked 100 people “What is the street number of your house?” you might get close to 100 different answers. This is a bit annoying but it’s not going to break your statistical software.

Some questions have got an *infinite *number of answers. Literally, there are not enough numbers that exist to capture all the possibilities. It might surprise you to learn which variables fit into this category; like height, weight and age.

This is because 1.543 metres *is not that same* as 1.5429 metres. These are different numbers. 1.54299 is different again. So is 1.542999 metres. I think you see what I’m saying, there is an unlimited number of numbers available to us to represent someone’s height. The same is true for weight and age (and blood pressure, and a lot of other medical measurements). In practice we are stuck with a more limited number of options but this doesn’t change the fact that the variable itself has infinite possibilities. Please ask a question about this in the comments if you are confused. A good rule of thumb is that almost all measurements are continuous.

Who cares if a variable is continuous? You don’t need to care very often. But when we start to talk about the probability of particular weights and heights this theoretical detail becomes extremely important.

Continuous variables do no require coding as they are always numeric.

**Discrete Variable **–** **All continuous variables are numeric, but* not all numeric variables are continuous. *

“How many children do you have?” does not have an infinite number of answers, it has a finite or ‘discrete’ number. You cannot have 2.7 children, it’s either 2 or 3. You might be thinking “Ok – whole numbers means discrete variable” but this is a trap. What about the variable “Shoe size” ? This is often a number like 6.5 or 10.5, but there are not an infinite number of shoe sizes that exist, that would be the end of shoe manufacturing as we know it!

Discrete variables do not need coding because they are numeric

**(Special Case) Binary Variable** – If a question only has two possible answers it is called a ‘binary’ variable. These can be numeric or categorical, if there are two answers it is binary. Categorical binary variables are coded as 1 and 0 for useful reasons we won’t go into today. The variable below “Do you have a pet?” is binary (you’ve either got a pet or you don’t).

So that wraps up our chat about variables. Once we start building models and doing fancy things you’ll realise how important this theory is. Test your understanding with the quick quiz below, all the answers are in this post.

Good luck out there,

Taya

The post Statistical Basics – Variables appeared first on Survive Statistics.

]]>The post Statistical basics – Mean & Median appeared first on Survive Statistics.

]]>You probably already know what the mean and median are, but there is actually a lot to learn here. Confusion about mean and median can derail more advanced statistical topics, or worse — cause you to make crazy conclusions about your data. Let’s go.

To find the mean (or ‘average’) of some numbers, add them up and divide by the number of items. So the average of 1, 5, 8, and 10 is ( 1 + 5 + 8 + 10) / 4 = 24/4 = 6. We divided by 4 because there are 4 numbers.

Whenever you encounter a mean in the wild, I want you to visualise it like this:

Not very impressive is it? The mean is a one-dimensional thing. It does not tell you anything about the spread of the data, it just gives you a suggestion about where the centre might be. Never forget that it is just a suggestion.

You can think about an average as a centre of mass:

Where is the centre of mass on this broom? Is it the blue line or the pink cross? Go and find a broom if you’re unsure. It is the pink cross. Why? Because the heavy end of the broom pulls the centre of mass away from the centre of the broom toward itself. It turns out the same is true for data…

(This section based on by this excellent post by Michelle Paret)

If I told you that the average starting salary for geography students at a particular university in the 1980s was over $80,000, you might think you’re doing the wrong degree. But let’s look at the data used to produce this average:

Major Lesson: The mean might represent nobody.

Michael Jordan is the heavy end of the broom here. The presence of Michael Jordan has nearly doubled the ‘average’ income* (without him it is around $40,000). We call observations like this ‘outliers’. MJ’s income is significantly larger than everyone else’s, there might be an argument for excluding him from analyses about the income of geography students.

Note: This is not me giving you free reign to start deleting data that doesn’t look the way you want. Pruning inconvenient observations from your data is* statistical skullduggery* and *not ok —* unless you are confident that either a) the data is incorrect (e.g. measurement error) or b) there is some factor producing the weird observations which is going to undermine the aim of your study. What looks like an ‘outlier’ might really be a hint that you do not have enough statistical power. So always tell your readers if any observations were excluded and why.

The median is also a one-dimensional measure of central tendency. Statisticians sometimes call the median the ‘Geometric Mean’ — this is because the median is the centre of the broom (the blue line):

To get the median of some data, arrange it from smallest to largest and pick the number in the middle. So the median of 1, 5, 6, 8, and 10 is 6. The interesting thing is that the median of 1, 5, 6, 8, and 100,000 is also 6, because *the mean is insensitive to outliers. *Just like adding weight to the end of a broom doesn’t change its geometric centre, adding huge observations to the end of a distribution does not change the median. This makes it a very handy statistic.

Calculating the difference between the mean and median can tell you a lot about your data.

We don’t expect the price of diamonds to be a bell-shaped distribution. Most diamonds are kind of affordable, but a small number of diamonds are really expensive. These are not ‘outlier’ Michael Jordan diamonds because there is no gap between the ‘super expensive’ diamonds and other diamonds. We call this a ‘tail’ – see how the data on the right side of the histogram looks like a tail? Data that has a tail on the right like this is called ‘right-skewed’ data. Below is a right-skewed dinosaur.

We know data is right-skewed when the mean is bigger than the median. Why? Because if the mean is to the right of the median there must be at least one big number pulling the mean *up. *If the mean is to the left of the median there must be at least one tiny number pulling the mean *down. * We call this ‘left-skewed’.

Who cares? Skewed data is inconvenient because it is not appropriate to do some kinds of analysis on skewed data. So checking out the mean and median is a great first step in any analysis plan.

In addition to diagnosing skewness, when we have some data we need a way to get a handle on and communicate the *distribution *of a thing. When was the last time you heard a news report about scientific research and the reporter started reciting raw data at you? Never, because that is a terrible way to talk about data. We need a way to quickly communicate our findings: ‘On average women live X years longer than men’ is something people can quickly understand. The gold standard for ‘getting a sense of the distribution’ is to visualise the data, but this is not always feasible and graphs take up lots of room in reports. So we often summarise a whole lot of data with a single summary statistic like the mean or median. If someone is reporting a median it is probably because their data is skewed.

It is really important that you understand the strengths and weaknesses of the mean and median, because if you don’t you can end up doing a geography degree because you want an NBA player’s salary.

Good luck out there!

Taya

The post Statistical basics – Mean & Median appeared first on Survive Statistics.

]]>The post Statistical Basics – Parameter Estimation appeared first on Survive Statistics.

]]>Statistical textbooks often contain a paragraph like this somewhere toward the front:

“The fundamental problem in statistical estimation is to choose an appropriate

value for some population parameter, theta . The true value of theta is unknown as it is

unlikely that we will be able to observe the entire population (an example of theta

could be the population mean μ). Using a (random) sample from the population,

we choose a method to calculate an approximation of the parameter of interest

– this provides us with an estimator of the parameter. The result, based on the

observed sample using this estimator is known as the estimate for

theta.”

It is quite possible to get through an introductory statistics unit and not understand parameter estimation at all. Often it is the boring first slide in a boring first lecture about sampling. But the truth is that **parameter estimation is at the heart of all research.** Everything we do in the lab, in a research hospital, in the field, is designed to give us the best estimate of something we are interested in, so that we might gain some insight into “the truth”. You might have heard about “bias” that can happen in research, especially medical studies where it is not easy to control differences. Bias is only a problem because it impairs our ability to estimate parameters accurately, but I’m getting a bit ahead of myself here.

For now, ‘parameter’ is just a fancy word that means ‘useful number’. Knowing the values taken by certain parameters helps us live our lives, improve our health, and do just about everything else we do. What is the weather forecast for tomorrow? That is a parameter which can take a range of reasonable values. Is it raining right now? That is a binary parameter which can take the values 0 (not raining) and 1 (raining, grab your umbrella). How much stronger is steel than iron? If I splice this gene into this mouse, what happens to the chance the mouse gets cancer? Almost everything we do in research is related to discovering the value of some parameter.

Parameters can be functions of other parameters, and this is where it gets confusing. What is the average life expectancy for patients with stage 4 melanoma? That is a parameter. What is the *difference* in life expectancy between patients with stage 3 and stage 4 melanoma? That is a parameter too! The first job of a statistician on a new project is often to get to the bottom of *exactly *what is being estimated – it is the average of one group? The difference between two groups? The slope of a regression line? They are all parameters.

When we say “this is an estimate” we usually mean “this might be wrong”. So why do we have to estimate things when we conduct research? Aren’t we collecting actual data and measurements? It is important to understand the answers to these questions. To do so, we need to take a big-picture view of what we are actually doing when we try to answer a question using observed data.

I like to tell my students that the “all-knowing statistical god” knows the values of all parameters. But we don’t. The job of the statistician is to estimate the unobservable. This is an exciting, and mysterious thing to do!

Say I am interested in the average height of 5 year old children. What is important to grasp is that somewhere out there in the universe is a number which represents the real sum of the heights of all 5 year old children that exist divided by the number of children. The number answering our question exists, but it is known only to the ‘all-knowing statistical god’, and not to us. If we knew this number we wouldn’t be bothering to conduct the research! Unless we have access to unlimited research dollars (ha) it is not going to be possible to *calculate *this number directly. We cannot actually measure all the children that exist and sum their heights to calculate an average. And even if we did have the resources, by the time we got to the last child the first one is probably already taller. So we have to choose a subset of children, measure them, and *estimate *the population average using data from our sample.

**Key point: Because we have limited resources we are obliged to sample. Because we are obliged to sample we are obliged to estimate.**

The sample data is our *observed* data. It is not the same as the actual value of the parameter we are interested in. We might have sampled from a particular neighbourhood of giant children, and so our estimate will be way off. Understanding the relationship between your research sample and your research population is very important.

I am interested in the average weight of Labrador dogs. My population is:

- Labrador dogs in my city
- All dogs
- All Labrador dogs
- Labrador dogs in my country
- Labrador dogs belonging to people I know who will let me do research on them

Hopefully you agree that my population is all labrador dogs. If I am only interested in dogs from my city I need to specify this in the aims of my study. More importantly,* *when I get to the exciting part of declaring my estimate for the average weight of Labradors I need to remind my readers that *this estimate only applies to dogs from a certain place* – it is a common mistake to forget about the population from which you have sampled. If you only sample women, your results only apply to women. If you only sampled people aged 30 – 50, you cannot apply these results to the elderly. This is what people mean when they refer to the “generalisability” of results.

We are using data about a small group of people to estimate data about a very large group of people. Does this sound ideal? Or even sensible? Once you start to think about it, this whole parameter estimation idea seems a bit strange.

Thankfully, generations of statisticians before you have literally obsessed over the tendency of sample data to give you a decent estimate of population data. The things they have discovered will be the topic of the next post.

Good luck out there,

Taya

This post has been summarised into a handy A4 cheat sheet. You’re welcome!

The post Statistical Basics – Parameter Estimation appeared first on Survive Statistics.

]]>The post Sample Size and Power Calculations Explained appeared first on Survive Statistics.

]]>*“I was told by my advisor to conduct a sample size / power analysis, because we want to state how many subjects we will need… I approached a statistical consultant at my university, and he told me that I needed to run a pilot study… I really don’t have the time or money to do that … how in the world is this done, ever? Can someone tell me how I should proceed? I really just want to know how many subjects to say I will need in my proposal.” *

So how in the world is this done, ever? I think the best way to learn about statistical power and required sample size is to use an analogy about looking for things in your own house:

Imagine you are in your kitchen and you need a hammer for something. You have a 5 year old son, so you say “Could you go to the basement and get the hammer for me?”

After a while he comes back and says **“It’s not there”**. As an adult you know that there are really two possibilities here – either *the hammer is down there and he missed it*, or *the hammer really isn’t there*. Statistical power is all about managing these two probabilities in your experiments, because experiments in epidemiology are really just hunts in the basement for things like “differences between groups” or “evidence this exposure is related to that outcome”.

As a scientist you need to be able to unpack what is meant by the statement “It’s not there”. Did your experiment fail to detect something? Or is there nothing to detect? We shouldn’t be concerned about experiments which find no result *because there is really no result*. But how can we design an experiment that is safe-guarded against *missing* a result?

There are four ingredients to any power analysis, and they apply to looking for hammers, too:

- Effect size
- Sample size (N)
- Alpha significance criterion (α)
- Statistical power, or implied beta (β)

If we’re searching for a giant novelty 10ft hammer then we don’t have much to worry about. Similarly, if we are expecting a treatment to produce a huge effect then we shouldn’t have to work too hard to find it with an experiment. “Huge treatment effect” translates to “big gap between the peaks” on the figure below.

If these red and blue distributions represent *what is really going on* this is a nice tidy situation, there is fresh air between the means of these distributions. The groups don’t overlap a lot so we won’t need to supercharge an experiment in order to detect a difference that looks like the one above. Compared to the figure below which depicts a much more difficult situation.

It also matters how tidy your basement is, because looking for something when things are all on top of each other is not easy.

If we’re looking for a difference in a continuous measure (like weight, blood pressure, or some other thing) we also need to consider the degree of messiness going on. This relates to the variance of the thing you’re measuring.

In summary, if your experiment is looking for a *difference between groups* you need to have a sense of how much the distributions are overlapping. They might overlap because the effect size is small (looking for small thing: tiny hammer) or because the variance of your measurement is large (identifying things is hard: messy basement).

It is important to understand that in practice neither the effect size or shape of relevant distributions is actually knowable by you (otherwise why bother doing the experiment?) We have to take a guess at what the distributions are likely to look like – by talking to experts, or looking for studies that have measured something similar. Another popular approach is to say “what is the smallest difference I care about?” and acknowledge that any difference smaller than that will probably be missed by your study. For example, if it turns out that a treatment increases life expectancy by 1 day on average this might not be worth discovering, but we might want to pick up a difference of one month, or one year.

Thanks to the Central Limit Theorem (post on that coming up) – the bigger the sample, the better job it does at revealing the truth to you. This is sort of analogous to the amount of time you’re spending in the basement looking for that hammer. If you just pop down for two minutes you’re less likely to find it than if you spend 3 hours down there. Having more people in your study is akin to a longer search of the basement. Usually sample size is the unknown in the equation – you need to know how many subjects to use in order to have a good chance at discovering a result. But not always, sometimes sample size is fixed and we need to know how much statistical power we have access to. For example if you have a tissue bank of 15 donor brains then you have 15 brains and that is it – the question becomes “Can I get acceptable power given the sample size I have?”

Ok this is where things get a bit metaphysical. In every study there is a chance that we’re going to conclude (incorrectly) that we found something. We’re going to say “I found the hammer!” but what I really found was something else, maybe a wrench. We’re going to say “this drug extends survival!” but really we managed to sample people (by chance) so that survival was worse in the placebo group. The drug actually does nothing.

‘Alpha’ is the proportion of times we do an experiment designed exactly like this one that we will “find something” that isn’t really there. It is a long-run frequency. If we were to repeat this experiment 100 times, if alpha = 0.05 that means we are comfortable that 5/100 times we find something that isn’t there. If we want more certainty (want to reduce alpha to < 0.05 %) sample size will need to increase. Alpha is usually 0.05, but not always.

Just as we can find a result that doesn’t really exist, we can miss a result that does exist (not find the hammer in the basement). Again imagining a scenario where we repeat the experiment 100 times, the proportion of times we fail to find a result that exists is beta (β). Therefore, by simple arithmatic, the proportion of times we make a *correct conclusion* = 1 – * *β. We call (1 – * *β) “Statistical Power”. Power of 80% (beta = 0.20) is normal in most fields.

Unfortunately there is a ‘trade off’ between preventing false positives and producing false negatives. Understanding why requires a reasonably solid understanding of how sampling distributions work under different hypotheses, so I’m going to leave that explanation for another time. But the intuition is pretty simple; the more time we spend in the basement looking for stuff (trying to avoid missing things that are really there), the more likely it becomes we’re going to find something that isn’t real. After a while everything down there starts to look like a hammer!

So, to discover how many subjects to include, you need to decide the smallest effect size you want to detect, make an assessment of the variation in the thing you’re measuring, think about what kind of probability you can accept re: finding something that is not real, and the kind of probability you want re: missing something that *is* real. Then you need to plug all this into a calculator (see below), statistical package, or human with statistical know-how. The exact formula will depend on the kind of study you are designing.

Good luck out there!

Taya

Calculator for difference of group means

Calculator for a single population mean

R code for the plots is here:

ggplot(data.frame(x = c(-7, 7)), aes(x)) +

stat_function(fun = dnorm, colour= “cornflowerblue”, args = list(mean = 2, sd = 1)) +

stat_function(fun = dnorm, colour= “rosybrown”, args = list(mean = 0, sd = 1)) + theme_classic()

The post Sample Size and Power Calculations Explained appeared first on Survive Statistics.

]]>