Objectives

You’ll almost always have datasets that contain more than one variable. It’s helpful to know how to compare variables in the dataset, and to estimate the extent to which one variable is related to another. You’ve already learned some great descriptive tools you can use to compare multiple variables in a dataset (box and whisker plots, bar charts, line plots, etc.). This week, we are going to learn how to compute, visualize, and interpret the correlation between two variables.

Let’s get started!


Set-up

library(tidyverse)

Correlation

So far, we’ve focused on visualizing the variation of a single variable using tools like violin plots and box and whisker plots. It is often very useful to explore how two variables covary. The covariance of two variables is the joint variation of two variables about their common mean. To remove any influence from differences in the measurement units of the two variables, we often compute correlation, which is simply the covariance of two variables divided by the product of the standard deviations of each variable, or:

\[ r_{xy} = \frac{cov_{xy}}{s_x s_7} \]

A correlation of +1 indicates a perfect direct relationship between two variables (i.e. if one variable increases by one unit, the other variable increases by one unit) while -1 indicates a perfect indirect relationship (i.e. if one variable increases by one unit, the other variable decreases by one unit). Let’s load our 500 Cities Health dataset and look at the correlation between two variables:

health <- readRDS("./data/health.RDS")
glimpse(health)
## Observations: 500
## Variables: 10
## $ CityName         <fct> Abilene, Akron, Alameda, Albany, Albany, Albu...
## $ StateAbbr        <fct> TX, OH, CA, GA, NY, NM, VA, CA, TX, PA, TX, C...
## $ PopulationCount  <int> 117063, 199110, 73812, 77434, 97856, 545852, ...
## $ BingeDrinking    <dbl> 16.2, 14.8, 15.0, 10.9, 15.5, 14.5, 15.1, 12....
## $ Smoking          <dbl> 19.6, 26.8, 11.9, 23.7, 19.0, 18.8, 13.0, 12....
## $ MentalHealth     <dbl> 11.6, 15.3, 9.8, 16.2, 13.2, 11.6, 8.4, 10.1,...
## $ PhysicalActivity <dbl> 27.7, 31.0, 18.7, 33.1, 26.1, 20.4, 17.6, 24....
## $ Obesity          <dbl> 33.7, 37.3, 18.7, 40.4, 31.1, 25.5, 23.3, 18....
## $ PoorHealth       <dbl> 12.6, 15.5, 9.6, 17.4, 13.1, 12.1, 8.4, 11.4,...
## $ LowSleep         <dbl> 35.4, 44.1, 32.3, 46.9, 39.7, 32.8, 34.5, 38....

Let’s start by visualizing the two variables of interest, say LowSleep and Obesity:

ggplot(health) +
  geom_point(aes(x = LowSleep, y = Obesity)) +
  theme_bw()

Nice. It certainlty looks like there’s a positive relationship between the two variables. geom_point() makes it easy for us to change the symbology of these points based on groups they belong to or other variables. We could, for example, change point size based on the population of the city.

ggplot(health) +
  geom_point(aes(x = LowSleep, y = Obesity, size = PopulationCount), alpha = 0.5) +
  theme_bw()

Now, finally, let’s compute the correlation:1

cor(health$LowSleep, health$Obesity)
## [1] 0.6315024

And let’s visualuze a line with the is slope on top of the scatter plot we just created:

ggplot(health) +
  geom_point(aes(x = LowSleep, y = Obesity)) +
  geom_abline(aes(slope = cor(health$LowSleep, health$Obesity), intercept = 6), color = "red") +
  theme_bw()

Wow, 0.6315024 is a very high correlation. This must mean that sleeping less causes obesity. Call the press! This is big news!

WRONG! CORRELATION DOES NOT IMPLY CAUSATION.

Let me repeat that:

CORRELATION DOES NOT IMPLY CAUSATION

Should I repeat again? Ok.

CORRELATION DOES NOT IMPLY CAUSATION

Hang on, one more time, coming from Neil deGrasse Tyson, to make sure you never, ever forget:

Visit this website for many examples of how two random and unrelated variables may covary across space or time (‘Age of Miss America’ and ‘Murders by steam, hot vapours and hot objects’), though I think there’s some truth to this one as Nicolas Cage generally makes me want to drown myself:

Correlation tells you nothing more than the relationship between two variables. When LowSleep increases, does Obesity tend to increase or decrease? It tells you nothing about whether or how LowSleep causes Obesity. Establishing a causal link between two variables requires fairly sophisticated statistics. If you are interested in learning more about this, I highly recommend looking into the field of causal inference. I also recommend befriending an econometrician. The point here is that you, as data producers and consumers, should be very critical of any claims you read/produce establishing a causal link between one thing and another.

Another word of warning about correlation. Classic indicators of correlation (e.g. the Pearson correlation coefficient) find only linear relationships between two variables. Imagine you have the following data:

If you computed the correlation, you’d find a value of -0.2958714, which visually, would look something like this:

This also implies that there is a negative relationship between \(x\) and \(y\), which is certainly not the case when you look at your data. It’s always a good idea to look at your data before you collapse it to a single summary statistic. In this case, the best fitting function is parabolic rather than linear - the relationship is far more complicated that correlation alone would suggest:

Visualizing relationships with scatterplots

Say we want to make a really awesome visualization that compares smoking rates and self-reported PoorHealth across all cities. And say we are particularly interested in the state of Utah as compared to other states in the U.S. First, let’s compare these two variables using a geom_point() scatter plot.

ggplot(data = health) +
  geom_point(aes(x = Smoking, y = PoorHealth))

There’s some obvious stuff we can change quickly:

ggplot(data = health) +
  geom_point(aes(x = Smoking, y = PoorHealth)) +
  xlab("Smoking rates") +
  ylab("Self-reported poor health") +
  labs(title = "The relationship between smoking and self-reported poor health", subtitle = "U.S. urban adults over the age of 18") +
  theme_classic()

Alright, now say we want to highlight a certain subset of the points (say Utah) by visualizing them in a different way than the rest of the data. The easiest way to do this is to divide the data into two chunks and apply different visualization parapeters to each chunk, like so:

utah <- health %>% filter(StateAbbr == "UT")
not_utah <- health %>% filter(StateAbbr != "UT")

s <- ggplot(data = health) +
  geom_point(data = utah, aes(x = Smoking, y = PoorHealth), color = "red") +
  geom_point(data = not_utah, aes(x = Smoking, y = PoorHealth), color = "light gray", alpha = 0.5) +
  xlab("Smoking rates") +
  ylab("Self-reported poor health") +
  labs(title = "The relationship between smoking and self-reported poor health", subtitle = "U.S. urban adults over the age of 18") +
  theme_classic()

s

alpha sets the transparancy of the point, with lower values being more transparent. What if we want to add a label to the Utah points? We could, say, add a label indicating the city name:

s +
  geom_text(aes(x = Smoking, y = PoorHealth, label=CityName))

AHHH!!! What happened!?! We added CityName labels to all of the points in the dataset. We really only want to apply this to the Utah set of points:

s +
  geom_text(data = utah, aes(x = Smoking, y = PoorHealth, label=CityName))

Better. Let’s work on the position of the labels relative to the red points. We can do this with hjust (horizontal movement, negative values to the right, positive values to the left) and vjust (vertical movement, negative values move up, positive values move down):

s +
  geom_text(data = utah, aes(x = Smoking, y = PoorHealth, label=CityName),
            hjust = -0.1, vjust=0.3)  # unfortunately this involves some fiddling

We could keep fiddling with hjust and vjust until things are perfect, OR, the easier way is to write this out as a .svg figure and drag and drop labels in InkScape:

ggsave(file="scatter.svg", plot=s, width=10, height=8)

You can read more about writing your figures to a other types of files here.


Additional Resources


  1. Note that if you have missing data in either variable, you’d want to add na.rm=T to the cor() function. Remember, though, that it’s always a good idea to know why you have missing data and to consider how exlusion of this data may bias results.