### Statistical basics – Normal distribution Part 1

Most students can draw a normal distribution and name a…

This post has been summarised into a handy cheat sheet – you’re welcome!

You probably already know what the mean and median are, but there is actually a lot to learn here. Confusion about mean and median can derail more advanced statistical topics, or worse — cause you to make crazy conclusions about your data. Let’s go.

To find the mean (or ‘average’) of some numbers, add them up and divide by the number of items. So the average of 1, 5, 8, and 10 is ( 1 + 5 + 8 + 10) / 4 = 24/4 = 6. We divided by 4 because there are 4 numbers.

Whenever you encounter a mean in the wild, I want you to visualise it like this:

Not very impressive is it? The mean is a one-dimensional thing. It does not tell you anything about the spread of the data, it just gives you a suggestion about where the centre might be. Never forget that it is just a suggestion.

You can think about an average as a centre of mass:

Where is the centre of mass on this broom? Is it the blue line or the pink cross? Go and find a broom if you’re unsure. It is the pink cross. Why? Because the heavy end of the broom pulls the centre of mass away from the centre of the broom toward itself. It turns out the same is true for data…

(This section based on by this excellent post by Michelle Paret)

If I told you that the average starting salary for geography students at a particular university in the 1980s was over $80,000, you might think you’re doing the wrong degree. But let’s look at the data used to produce this average:

Major Lesson: The mean might represent nobody.

Michael Jordan is the heavy end of the broom here. The presence of Michael Jordan has nearly doubled the ‘average’ income* (without him it is around $40,000). We call observations like this ‘outliers’. MJ’s income is significantly larger than everyone else’s, there might be an argument for excluding him from analyses about the income of geography students.

Note: This is not me giving you free reign to start deleting data that doesn’t look the way you want. Pruning inconvenient observations from your data is* statistical skullduggery* and *not ok —* unless you are confident that either a) the data is incorrect (e.g. measurement error) or b) there is some factor producing the weird observations which is going to undermine the aim of your study. What looks like an ‘outlier’ might really be a hint that you do not have enough statistical power. So always tell your readers if any observations were excluded and why.

The median is also a one-dimensional measure of central tendency. Statisticians sometimes call the median the ‘Geometric Mean’ — this is because the median is the centre of the broom (the blue line):

To get the median of some data, arrange it from smallest to largest and pick the number in the middle. So the median of 1, 5, 6, 8, and 10 is 6. The interesting thing is that the median of 1, 5, 6, 8, and 100,000 is also 6, because *the mean is insensitive to outliers. *Just like adding weight to the end of a broom doesn’t change its geometric centre, adding huge observations to the end of a distribution does not change the median. This makes it a very handy statistic.

Calculating the difference between the mean and median can tell you a lot about your data.

We don’t expect the price of diamonds to be a bell-shaped distribution. Most diamonds are kind of affordable, but a small number of diamonds are really expensive. These are not ‘outlier’ Michael Jordan diamonds because there is no gap between the ‘super expensive’ diamonds and other diamonds. We call this a ‘tail’ – see how the data on the right side of the histogram looks like a tail? Data that has a tail on the right like this is called ‘right-skewed’ data. Below is a right-skewed dinosaur.

We know data is right-skewed when the mean is bigger than the median. Why? Because if the mean is to the right of the median there must be at least one big number pulling the mean *up. *If the mean is to the left of the median there must be at least one tiny number pulling the mean *down. * We call this ‘left-skewed’.

Who cares? Skewed data is inconvenient because it is not appropriate to do some kinds of analysis on skewed data. So checking out the mean and median is a great first step in any analysis plan.

In addition to diagnosing skewness, when we have some data we need a way to get a handle on and communicate the *distribution *of a thing. When was the last time you heard a news report about scientific research and the reporter started reciting raw data at you? Never, because that is a terrible way to talk about data. We need a way to quickly communicate our findings: ‘On average women live X years longer than men’ is something people can quickly understand. The gold standard for ‘getting a sense of the distribution’ is to visualise the data, but this is not always feasible and graphs take up lots of room in reports. So we often summarise a whole lot of data with a single summary statistic like the mean or median. If someone is reporting a median it is probably because their data is skewed.

It is really important that you understand the strengths and weaknesses of the mean and median, because if you don’t you can end up doing a geography degree because you want an NBA player’s salary.

Good luck out there!

Taya

Article Tags

Tags: cheat sheet, Statistics tutorial

Related article

Most students can draw a normal distribution and name a…

Hang onto your hats, statfans! Hopefully you’ve read the rest…

Understanding sampling distributions is really important, but very few researchers…

Type your search keyword, and press enter to search

The broom: Very nice analogy, Taya. I may use that (with attribution of course …).

Of course, what a fantastic blog and instructive posts, I surely will bookmark your website.Have an awsome day!

I love the broom too! (I also love the dinosaur!). Thank you for your super amazing blog. And pretty please do some more soon! Ummmm…..hypothesis testing would be awesome…… Also, I’m going to post this link on the Facebook group page for my course, so expect a few more hits soon!