Statistical basics – Normal distribution Part 1
Most students can draw a normal distribution and name a…
This post has been summarised into a handy cheat sheet – you’re welcome!
You probably already know what the mean and median are, but there is actually a lot to learn here. Confusion about mean and median can derail more advanced statistical topics, or worse — cause you to make crazy conclusions about your data. Let’s go.
To find the mean (or ‘average’) of some numbers, add them up and divide by the number of items. So the average of 1, 5, 8, and 10 is ( 1 + 5 + 8 + 10) / 4 = 24/4 = 6. We divided by 4 because there are 4 numbers.
Whenever you encounter a mean in the wild, I want you to visualise it like this:
Not very impressive is it? The mean is a one-dimensional thing. It does not tell you anything about the spread of the data, it just gives you a suggestion about where the centre might be. Never forget that it is just a suggestion.
You can think about an average as a centre of mass:
Where is the centre of mass on this broom? Is it the blue line or the pink cross? Go and find a broom if you’re unsure. It is the pink cross. Why? Because the heavy end of the broom pulls the centre of mass away from the centre of the broom toward itself. It turns out the same is true for data…
(This section based on by this excellent post by Michelle Paret)
If I told you that the average starting salary for geography students at a particular university in the 1980s was over $80,000, you might think you’re doing the wrong degree. But let’s look at the data used to produce this average:
Major Lesson: The mean might represent nobody.
Michael Jordan is the heavy end of the broom here. The presence of Michael Jordan has nearly doubled the ‘average’ income* (without him it is around $40,000). We call observations like this ‘outliers’. MJ’s income is significantly larger than everyone else’s, there might be an argument for excluding him from analyses about the income of geography students.
Note: This is not me giving you free reign to start deleting data that doesn’t look the way you want. Pruning inconvenient observations from your data is statistical skullduggery and not ok — unless you are confident that either a) the data is incorrect (e.g. measurement error) or b) there is some factor producing the weird observations which is going to undermine the aim of your study. What looks like an ‘outlier’ might really be a hint that you do not have enough statistical power. So always tell your readers if any observations were excluded and why.
The median is also a one-dimensional measure of central tendency. Statisticians sometimes call the median the ‘Geometric Mean’ — this is because the median is the centre of the broom (the blue line):
To get the median of some data, arrange it from smallest to largest and pick the number in the middle. So the median of 1, 5, 6, 8, and 10 is 6. The interesting thing is that the median of 1, 5, 6, 8, and 100,000 is also 6, because the mean is insensitive to outliers. Just like adding weight to the end of a broom doesn’t change its geometric centre, adding huge observations to the end of a distribution does not change the median. This makes it a very handy statistic.
Calculating the difference between the mean and median can tell you a lot about your data.
We don’t expect the price of diamonds to be a bell-shaped distribution. Most diamonds are kind of affordable, but a small number of diamonds are really expensive. These are not ‘outlier’ Michael Jordan diamonds because there is no gap between the ‘super expensive’ diamonds and other diamonds. We call this a ‘tail’ – see how the data on the right side of the histogram looks like a tail? Data that has a tail on the right like this is called ‘right-skewed’ data. Below is a right-skewed dinosaur.
We know data is right-skewed when the mean is bigger than the median. Why? Because if the mean is to the right of the median there must be at least one big number pulling the mean up. If the mean is to the left of the median there must be at least one tiny number pulling the mean down. We call this ‘left-skewed’.
Who cares? Skewed data is inconvenient because it is not appropriate to do some kinds of analysis on skewed data. So checking out the mean and median is a great first step in any analysis plan.
In addition to diagnosing skewness, when we have some data we need a way to get a handle on and communicate the distribution of a thing. When was the last time you heard a news report about scientific research and the reporter started reciting raw data at you? Never, because that is a terrible way to talk about data. We need a way to quickly communicate our findings: ‘On average women live X years longer than men’ is something people can quickly understand. The gold standard for ‘getting a sense of the distribution’ is to visualise the data, but this is not always feasible and graphs take up lots of room in reports. So we often summarise a whole lot of data with a single summary statistic like the mean or median. If someone is reporting a median it is probably because their data is skewed.
It is really important that you understand the strengths and weaknesses of the mean and median, because if you don’t you can end up doing a geography degree because you want an NBA player’s salary.
Good luck out there!