Statistical basics – Normal distribution Part 1
Most students can draw a normal distribution and name a…
To do statistics we need data. But more than that, we need variables, things that can vary. “What is happening with this thing that varies?” Is a pretty good summary of any statistical problem. I like to think of a variable as a question, a question with more than one possible answer. Visualise an excel spreadsheet while you think about variables – a spreadsheet with some observations (rows) and variables (columns).
Here we have two variables, “Name” and “Age”, containing the answers to “What is your name?” and “What is your age?” Don’t forget that the name of a variable is not the whole story – “Age” could mean age today, age at the time of a cancer diagnosis or the age of your car. Make sure you understand the real question each variable is answering.
So; ask a question, get a variable? Not quite. If the question I asked was “What is my name?” The answer would be “Taya” no matter how many people I ask. My name is not a variable, it is a constant.
So whether something is a variable depends on the question you are asking. The answer to “In what order do the days of the week occur?” is a constant – thank goodness – Wednesday always comes after Tuesday. But the answer to “What day is it today?” is a variable (if the study goes for more than one day) because the answer will change depending on when the question is asked.
We have developed special names for certain kinds of variables to frustrate statistics students.
If the answer to your question is a number (“What is your shoe size?”) then the variable is numerical, it is made of numbers. If the answer is a word (“What kind of dog do you have?”) you have categorical data. Don’t stress about this terminology too much.
I cannot directly analyse “Today is Tuesday” or “Andrew has blue eyes” with statistical techniques. If it is not a number already, each answer to our question needs to be converted into a number in order for us to analyse it. This conversion process is called “coding”. How a variable is coded depends on the kind of data in it. Let’s look at a few variables to see how this works.
Categorical Data (Data that is not numbers)
Ordinal Variable – Annoying surveys often ask you to answer with the options “Strongly Disagree”, “Disagree”, “Neutral”, “Agree” or “Strongly agree”. This data has a special structure, because if these are coded 0 “Strongly Disagree” to 4 “Strongly agree”;
0 = Strongly Disagree
1 = Disagree
2 = Neutral
3 = Agree
4 = Strongly agree
The numerical coding reflects a real hierarchy in the data. 4 is greater than 3; “Strongly Agree” is greater than “Agree”. It’s beautiful! What we never do is something like this:
0 = Strongly Disagree
2 = Disagree
1 = Neutral
4 = Agree
5 = Strongly agree
This is no good. The numerical coding has destroyed the hierarchy present in the data.
Nominal Variable – Sometimes there is no hierarchy in categorical data. If eye colour was coded 0 “Blue” 1 “Green” 2 “Brown” we have to randomly choose which option gets which number. It doesn’t matter whether Blue eyes is zero, or one, or two, because there is no hierarchy in eye colour.
Numerical Data (Data that is numbers)
Continuous Variable – This concept is tough, but you’re going to be fine because we’ve already set up the concept of a variable as the answer to a question.
Some questions have got a lot of answers. If you asked 100 people “What is the street number of your house?” you might get close to 100 different answers. This is a bit annoying but it’s not going to break your statistical software.
Some questions have got an infinite number of answers. Literally, there are not enough numbers that exist to capture all the possibilities. It might surprise you to learn which variables fit into this category; like height, weight and age.
This is because 1.543 metres is not that same as 1.5429 metres. These are different numbers. 1.54299 is different again. So is 1.542999 metres. I think you see what I’m saying, there is an unlimited number of numbers available to us to represent someone’s height. The same is true for weight and age (and blood pressure, and a lot of other medical measurements). In practice we are stuck with a more limited number of options but this doesn’t change the fact that the variable itself has infinite possibilities. Please ask a question about this in the comments if you are confused. A good rule of thumb is that almost all measurements are continuous.
Who cares if a variable is continuous? You don’t need to care very often. But when we start to talk about the probability of particular weights and heights this theoretical detail becomes extremely important.
Continuous variables do no require coding as they are always numeric.
Discrete Variable – All continuous variables are numeric, but not all numeric variables are continuous.
“How many children do you have?” does not have an infinite number of answers, it has a finite or ‘discrete’ number. You cannot have 2.7 children, it’s either 2 or 3. You might be thinking “Ok – whole numbers means discrete variable” but this is a trap. What about the variable “Shoe size” ? This is often a number like 6.5 or 10.5, but there are not an infinite number of shoe sizes that exist, that would be the end of shoe manufacturing as we know it!
Discrete variables do not need coding because they are numeric
(Special Case) Binary Variable – If a question only has two possible answers it is called a ‘binary’ variable. These can be numeric or categorical, if there are two answers it is binary. Categorical binary variables are coded as 1 and 0 for useful reasons we won’t go into today. The variable below “Do you have a pet?” is binary (you’ve either got a pet or you don’t).
So that wraps up our chat about variables. Once we start building models and doing fancy things you’ll realise how important this theory is. Test your understanding with the quick quiz below, all the answers are in this post.
Good luck out there,