So a little while back I saw a post on LinkedIn about how two pieces of data, which looked liked they correlated, equated directly to income… I hung my head in shame.
I’ll go through some examples of how correlation should imply a connection.
You do need to be careful when using a visual of two different things, there is usually another third variable at play which can affect both values and this, ladies and gentlemen is known as a confounder.
Bad Example Of Correlation and Causation.
I’ve taken some statistics from Tyler Vigens’ site and recreated the effects of what people would consider to be well correlated variables.
I’m sure we’d all agree before we saw this wonderful image that Arcade Revenue and Computer Science Doctorates had absolutely nothing to do with each other. In fact they don’t. But If presented in a meaningful kind of way you could have someone convinced that if Arcade Revenue goes up, so do Computer Science Doctorates.
Think of it this way, if I presented the idea that just volume of content on a website would directly affect income, the SEO guys would punch me in the face and stuff me in the fax machine.
This is where the confounder comes in.
In my face punching example above the confounder would represent a significant amount of variables and elbow grease like Keyword research, content creation and a dash of PR, plus many more variables. All these things contribute to a successful website.
Now For A Good Example Of Correlation
It’d be natural to assume that on hot days people would buy more ice cream right? Right? Well I came across a dataset at Maths Is Fun that demonstrates the point of the post perfectly.
Well, my oh my, it seems as though you were correct dear reader. It seems as though Ice cream sales do indeed rise when the temperature does.
This is known as a regression plot and it uses statistics to calculate the coefficient of the line, also known as “Line of best Fit”, or the slope.
This value shows how well things are correlated, the values can be anything between 1 and -1. Both extremes show either a high positive correlation or negative correlation.
The image above does imply that as temperature rises, so do ice cream sales. According to this dataset we can say that it’s true with 91% accuracy. There would of course be some kind of peak, if it got to 40 Degrees Centigrade many people wouldn’t go outside to brave turning into a tomato, another great instance of a confounder.
Let’s Talk Machine Learning
In the world of machine learning we have the concept of feature selection. A feature is basically some kind of value related to the dataset. If something is well correlated then we can be confident of saying that it is one of the values that would help being a good predictor (But not always, extremely high correlations can be removed to improve a machine learning models accuracy, but more on that in another post)
I have a huge dataset relating to the housing market, its only 1500 rows long but it has a whopping 81 columns or features. I used a pair grid to create small scatterplots to see where correlations lie within the dataset.
The image for illustration purposes only and it’s impossible to tell which feature is which but I’m trying to show the bigger picture.
Usually this would be done in batches so you could see with a bit more clarity. There is one clear line that goes diagonal top left to bottom right. These instances are perfectly correlated values, unfortunately they are the same features so would naturally be perfectly correlated. I’d sift through this and select the best features for predicting the price of a house.
This is how to understand correlation in a nutshell without using knives on the proverbial nutshell.