You’ll almost always have datasets that contain more than one variable. It’s helpful to know how to compare variables in the dataset, and to estimate the extent to which one variable is related to another. You’ve already learned some great descriptive tools you can use to compare multiple variables in a dataset (box and whisker plots, bar charts, line plots, etc.). This week, we are going to learn how to compute, visualize, and interpret the correlation between two variables.
Let’s get started!
library(tidyverse)
So far, we’ve focused on visualizing the variation of a single variable using tools like violin plots and box and whisker plots. It is often very useful to explore how two variables covary. The covariance of two variables is the joint variation of two variables about their common mean. To remove any influence from differences in the measurement units of the two variables, we often compute correlation, which is simply the covariance of two variables divided by the product of the standard deviations of each variable, or:
\[ r_{xy} = \frac{cov_{xy}}{s_x s_7} \]
A correlation of +1 indicates a perfect direct relationship between two variables (i.e. if one variable increases by one unit, the other variable increases by one unit) while -1 indicates a perfect indirect relationship (i.e. if one variable increases by one unit, the other variable decreases by one unit). Let’s load our 500 Cities Health dataset and look at the correlation between two variables:
health <- readRDS("./data/health.RDS")
glimpse(health)
## Observations: 500
## Variables: 10
## $ CityName <fct> Abilene, Akron, Alameda, Albany, Albany, Albu...
## $ StateAbbr <fct> TX, OH, CA, GA, NY, NM, VA, CA, TX, PA, TX, C...
## $ PopulationCount <int> 117063, 199110, 73812, 77434, 97856, 545852, ...
## $ BingeDrinking <dbl> 16.2, 14.8, 15.0, 10.9, 15.5, 14.5, 15.1, 12....
## $ Smoking <dbl> 19.6, 26.8, 11.9, 23.7, 19.0, 18.8, 13.0, 12....
## $ MentalHealth <dbl> 11.6, 15.3, 9.8, 16.2, 13.2, 11.6, 8.4, 10.1,...
## $ PhysicalActivity <dbl> 27.7, 31.0, 18.7, 33.1, 26.1, 20.4, 17.6, 24....
## $ Obesity <dbl> 33.7, 37.3, 18.7, 40.4, 31.1, 25.5, 23.3, 18....
## $ PoorHealth <dbl> 12.6, 15.5, 9.6, 17.4, 13.1, 12.1, 8.4, 11.4,...
## $ LowSleep <dbl> 35.4, 44.1, 32.3, 46.9, 39.7, 32.8, 34.5, 38....
Let’s start by visualizing the two variables of interest, say LowSleep
and Obesity
:
ggplot(health) +
geom_point(aes(x = LowSleep, y = Obesity)) +
theme_bw()
Nice. It certainlty looks like there’s a positive relationship between the two variables. geom_point()
makes it easy for us to change the symbology of these points based on groups they belong to or other variables. We could, for example, change point size based on the population of the city.
ggplot(health) +
geom_point(aes(x = LowSleep, y = Obesity, size = PopulationCount), alpha = 0.5) +
theme_bw()