Objectives

You’ll almost always work with data that contain more than one variable. We’ve already learned about some tools to compare the data by group or value. Today, we’ll begin to learn about inference from a set of data, e.g. estimate the extent to which one variable is related to another.

  1. Learn how to compute and interpret the correlation between two variables.
  2. Visualize the correlation between many variables.
  3. Expand your ggplot data visualization powers.

Let’s get started!


Set-up

library(tidyverse)
library(corrplot) # useful for visualizing correlation across many variables

Correlation

A (linear) correlation coefficient, \(r\),1 which measures the tendency of two numerical variables to covary, or change together along a line.2

\[ r = \frac{\sum_i (X_i - \bar{X})(Y_i- \bar{X})}{\sqrt{(X_i - \bar{X})^2} \sqrt{(Y_i - \bar{Y})^2}}\]

Here, the numerator is called the sum of products - it measures how deviations (remember, this is the difference between the observation \(X_i\) and the mean \(\bar{X}\)) in \(X\) and \(Y\) vary together. The denominator should look familiar (think back to computing standard deviation) - it includes the sum of squares for \(X\) and \(Y\). Correlation lies between -1 and 1 and has no units (so easy to interpret whatever the variables). A negative correlation means that as the value of one variable increases, the other decreases (they move in opposite directions); conversely, a positive correlation means that both variables increases and/or decreases together. A correlation of +1 indicates a perfect direct relationship between two variables (i.e. if one variable increases by one unit, the other variable increases by one unit) while -1 indicates a perfect indirect relationship (i.e. if one variable increases by one unit, the other variable decreases by one unit).

For this module, we’ll be working with a dataset near and dear to my heart… corn yield!

library(tidyverse)
corn <- readRDS("./data/corn.RDS") %>% filter(YEAR == 2017) # for this tutorial, we'll only work with one year of data
glimpse(corn)
Rows: 1,315
Columns: 18
$ GEOID              <fct> 01001, 01003, 01005, 01019, 01031, 01039, 01041,...
$ STATE_NAME         <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Ala...
$ COUNTY_NAME        <fct> Autauga, Baldwin, Barbour, Cherokee, Coffee, Cov...
$ YEAR               <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, ...
$ CORN_YIELD         <dbl> 170.8, 162.0, 192.3, 146.8, 120.2, 141.7, 145.0,...
$ GDD                <dbl> 4188.195, 4259.167, 4310.133, 3995.918, 4308.777...
$ TP                 <dbl> 1342.591, 1710.470, 1047.445, 1152.464, 1274.932...
$ SOIL_SUITABILITY   <dbl> 0.3266512, 0.2163734, 0.2608265, 0.3268577, 0.27...
$ AGE                <dbl> 57.2, 61.2, 61.0, 57.2, 61.4, 61.7, 61.1, 58.0, ...
$ PERC_FEMALE        <dbl> 25.59963, 23.77534, 23.33779, 25.17158, 20.36553...
$ FULL_OWNER         <dbl> 44.26066, 29.85876, 65.56943, 30.90442, 47.71895...
$ FERTILIZER_EXP     <dbl> 20.26741, 74.62687, 15.08367, 58.34178, 24.79390...
$ CHEMICAL_EXP       <dbl> 11.524603, 74.020469, 11.437138, 32.396536, 14.3...
$ PERC_IRRIGATED     <dbl> 10.680494, 23.625378, 15.125415, 14.892047, 16.2...
$ LABOR_EXPENSE      <dbl> 20.13494, 100.95937, 34.15429, 62.87334, 50.6260...
$ MACHINERY          <dbl> 316.7632, 657.4830, 283.7288, 548.0551, 396.4541...
$ GVT_PROGRAMS       <dbl> 9.440461, 36.132103, 19.659832, 17.673085, 31.17...
$ PERC_NATURAL_COVER <dbl> 0.8506672, 0.8270330, 0.8925271, 0.8367627, 0.78...

Let’s start by visualizing the two variables of interest, say CORN_YIELD and CHEMICAL_EXP:

ggplot(data = corn) +
  geom_point(aes(x = CHEMICAL_EXP, y = CORN_YIELD)) +
  theme_bw() +
  labs(x = "Chemical application ($/acre operated)",
       y = "Corn yield (bu/ac)", # I put yield on the y-axis since it is typically thought of as a response to chemical inputs
       title = "Chemical use and corn yields in the coterminous U.S.")

Nice. certainly looks like there’s a positive relationship between the two variables. Note also that there appear to be a few counties with abnormally high chemical expenses.

geom_point() makes it easy for us to change the symbology of these points based on groups they belong to or other variables. We could, for example, change point size based on the population of the city.

ggplot(data = corn) +
  geom_point(aes(x = CHEMICAL_EXP, y = CORN_YIELD, size = FERTILIZER_EXP), alpha = 0.2) +
  theme_bw() +
  labs(x = "Chemical application ($/operated acre)",
       y = "Corn yields (bu/ac)",
       title = "Inputs and corn yields in the coterminous U.S.",
       size = "Fertilizer ($/operated acre)") +
  theme(legend.position = "bottom")

Or we could also use facetting to show this scatterplot across multiple sub-categories! Here, we look at the relationship across a subset of states.

corn_sub <- corn %>% filter(STATE_NAME == c("Georgia", "Illinois", "Iowa", "Mississippi", "Pennsylvania", "New York"))
ggplot(data = corn_sub) +
  geom_point(aes(x = CHEMICAL_EXP, y = CORN_YIELD, size = FERTILIZER_EXP), alpha = 0.2) +
  theme_bw() +
  labs(x = "Chemical application ($/operated acre)",
       y = "Corn yields (bu/ac)",
       title = "Inputs and corn yields in the coterminous U.S.",
       size = "Fertilizer ($/operated acre)") +
  facet_wrap(~ STATE_NAME, ncol = 3) +
  theme(legend.position = "bottom")

Notice, of course, that like trimming outliers, the ways in which you subset the data across space or time will drastically change your estimated correlation. Consider Iowa, for which the correlation between chemical application and yields was 0.73 in 2017 and Mississippi, which had a correlation of 0.65 for the same variables over the same period.

Alright, now say we want to highlight a certain subset of the points (say the Midwestern U.S.) by visualizing them in a different way than the rest of the data. The easiest way to do this is to divide the data into two chunks and apply different visualization parameters to each chunk, like so:

corn_belt <- c("Indiana", "Illinois", "Iowa", "Missouri", "Nebraska", "Kansas")

cb <- corn %>% filter(STATE_NAME %in% c(corn_belt))
not_cb <- corn %>% filter(!STATE_NAME %in% corn_belt)

ggplot() +
  geom_point(data = cb, aes(x = CHEMICAL_EXP, y = CORN_YIELD), color = "red") +
  geom_point(data = not_cb, aes(x = CHEMICAL_EXP, y = CORN_YIELD), color = "light gray", alpha = 0.5) +
  labs(x = "Chemical application ($/operated acre)",
       y = "Corn yields (bu/ac)",
       title = "Inputs and corn yields in the coterminous U.S.",
       subtitle = "The Cornbelt as comapred to the rest of the U.S.",
       size = "Fertilizer ($/operated acre)") +
  theme_bw()

Now, finally, let’s compute the correlation:3

cor(corn$CORN_YIELD, corn$CHEMICAL_EXP) # note that if you have missing data, this will return NA
[1] 0.4058427

Note, importantly, how removing some of the high chemical application outliers from the data affects the correlation we compute:

clipped_corn <- corn %>%
  filter(CHEMICAL_EXP < 100)

cor(clipped_corn$CORN_YIELD, clipped_corn$CHEMICAL_EXP)
[1] 0.4370815

We’ll come back to this in a second, but remember, correlation assumes a LINEAR relationship between two variables… and this isn’t always the best assumption. Ok, so our two variables have a positive correlation, which means as one variable increases the other tends to increase. HOORAY! Solid scientific evidence that using more chemicals is great for yields!

WRONG! CORRELATION DOES NOT IMPLY CAUSATION.

Let me repeat that:

CORRELATION DOES NOT IMPLY CAUSATION

Should I repeat again? Ok.

CORRELATION DOES NOT IMPLY CAUSATION

Hang on, one more time, coming from Neil deGrasse Tyson, to make sure you never, ever forget:

Visit this website for many examples of how two random and unrelated variables may covary across space or time (‘Age of Miss America’ and ‘Murders by steam, hot vapours and hot objects’), though I think there’s some truth to this one as Nicolas Cage generally makes me want to drown myself:

Correlation tells you nothing more than the association between two variables, e.g. as one goes in one direction, the other tends to go in another direction. Correlation is a measure of association and tendency, NOT an indication of cause. Establishing a causal link between two variables requires fairly sophisticated statistics. If you are interested in learning more about this, I highly recommend looking into the field of causal inference. I also recommend befriending an . The point here is that you, as data producers and consumers, should be very critical of any claims you read/produce establishing a causal link between one thing and another.

Another word of warning about correlation. Classic indicators of correlation (e.g. the Pearson correlation coefficient) find only linear relationships between two variables. Imagine you have the following data:

If you computed the correlation, you’d find a value of -0.2958714, which visually, would look something like this:

This also implies that there is a negative relationship between \(x\) and \(y\), which is certainly not the case when you look at your data. It’s always a good idea to look at your data before you collapse it to a single summary statistic. In this case, the best fitting function is parabolic rather than linear - the relationship is far more complicated that correlation alone would suggest:

Correlation Matrices

Let’s return to our corny data. When wanting to quickly assess correlations between multiple variables, compute a cross-correlation matrix. This is essentially a matrix of the correlation between variables in a large data set.First, we need to select() the subset of columns for which we want to compute correlation.Then we feed this set of variables to the cor() function:

corn_sub <- corn %>% select(-c(GEOID, STATE_NAME, COUNTY_NAME, YEAR)) # can't compute correlation between strings
cor(corn_sub)
                    CORN_YIELD          GDD           TP SOIL_SUITABILITY
CORN_YIELD          1.00000000 -0.242108207 -0.027370772       0.49240759
GDD                -0.24210821  1.000000000  0.420363057      -0.33867059
TP                 -0.02737077  0.420363057  1.000000000       0.02049419
SOIL_SUITABILITY    0.49240759 -0.338670589  0.020494190       1.00000000
AGE                -0.11339492  0.377457000  0.100259968      -0.11483379
PERC_FEMALE        -0.18557542  0.044456669  0.053153380      -0.27594464
FULL_OWNER         -0.22889678  0.269708432  0.382703130      -0.42375411
FERTILIZER_EXP      0.45766737 -0.194282235 -0.023114408       0.47911922
CHEMICAL_EXP        0.40584265  0.003894432 -0.033842460       0.30626866
PERC_IRRIGATED      0.30229180  0.118104649 -0.079750576      -0.03418146
LABOR_EXPENSE       0.02895160 -0.107448729  0.002623342      -0.05827532
MACHINERY           0.32946279 -0.397451650  0.143860701       0.40627051
GVT_PROGRAMS        0.39836057  0.085002023  0.032992443       0.28274385
PERC_NATURAL_COVER -0.45934623  0.350667828  0.337215361      -0.68248613
                            AGE PERC_FEMALE  FULL_OWNER FERTILIZER_EXP
CORN_YIELD         -0.113394918 -0.18557542 -0.22889678     0.45766737
GDD                 0.377457000  0.04445667  0.26970843    -0.19428224
TP                  0.100259968  0.05315338  0.38270313    -0.02311441
SOIL_SUITABILITY   -0.114833791 -0.27594464 -0.42375411     0.47911922
AGE                 1.000000000  0.08077815  0.15298357    -0.16068384
PERC_FEMALE         0.080778147  1.00000000  0.38090912    -0.36167885
FULL_OWNER          0.152983574  0.38090912  1.00000000    -0.47149027
FERTILIZER_EXP     -0.160683842 -0.36167885 -0.47149027     1.00000000
CHEMICAL_EXP       -0.060013517 -0.36461501 -0.51863080     0.72411732
PERC_IRRIGATED     -0.052109218 -0.06776050 -0.24188609     0.24575218
LABOR_EXPENSE      -0.088945711  0.03192498  0.03090534     0.46059476
MACHINERY          -0.285863903 -0.11344942 -0.07235056     0.56205379
GVT_PROGRAMS        0.004368868 -0.28397240 -0.41036465     0.46899600
PERC_NATURAL_COVER  0.191505887  0.38951646  0.76660683    -0.56933709
                   CHEMICAL_EXP PERC_IRRIGATED LABOR_EXPENSE    MACHINERY
CORN_YIELD          0.405842655    0.302291797   0.028951598  0.329462788
GDD                 0.003894432    0.118104649  -0.107448729 -0.397451650
TP                 -0.033842460   -0.079750576   0.002623342  0.143860701
SOIL_SUITABILITY    0.306268660   -0.034181459  -0.058275320  0.406270509
AGE                -0.060013517   -0.052109218  -0.088945711 -0.285863903
PERC_FEMALE        -0.364615007   -0.067760501   0.031924981 -0.113449416
FULL_OWNER         -0.518630796   -0.241886092   0.030905337 -0.072350559
FERTILIZER_EXP      0.724117322    0.245752175   0.460594762  0.562053789
CHEMICAL_EXP        1.000000000    0.460927483   0.324010182  0.341312149
PERC_IRRIGATED      0.460927483    1.000000000   0.061421855 -0.001785137
LABOR_EXPENSE       0.324010182    0.061421855   1.000000000  0.537345941
MACHINERY           0.341312149   -0.001785137   0.537345941  1.000000000
GVT_PROGRAMS        0.615159781    0.480334498  -0.043518997  0.139179662
PERC_NATURAL_COVER -0.543255125   -0.242653950   0.009850475 -0.287863682
                   GVT_PROGRAMS PERC_NATURAL_COVER
CORN_YIELD          0.398360568       -0.459346231
GDD                 0.085002023        0.350667828
TP                  0.032992443        0.337215361
SOIL_SUITABILITY    0.282743846       -0.682486125
AGE                 0.004368868        0.191505887
PERC_FEMALE        -0.283972400        0.389516464
FULL_OWNER         -0.410364651        0.766606833
FERTILIZER_EXP      0.468996003       -0.569337088
CHEMICAL_EXP        0.615159781       -0.543255125
PERC_IRRIGATED      0.480334498       -0.242653950
LABOR_EXPENSE      -0.043518997        0.009850475
MACHINERY           0.139179662       -0.287863682
GVT_PROGRAMS        1.000000000       -0.476533018
PERC_NATURAL_COVER -0.476533018        1.000000000

The correlation matrix computes the correlations between column-row combinations. Notice that the diagonal of this matrix is 1.000. This is because the correlation of a variable with itself is perfect.

This is useful, but generally an ugly and difficult-to-interpret result, so let’s make this list of numbers visually appealing and easy to interpret. The corrplot() function from the corrplot package makes this very easy. All we have to do is feed the correlation matrix we just built to the corrplot() function, and VOILA:

corrplot::corrplot(cor(corn_sub), method = "number")

This is awful, we can do better.4 Notice that the top triangle and bottom triangle of values are the same, so you can simply display the top part. And let’s drop the numbers and instead use circles whose size indicate magnitude and color indicate direction:

corrplot(cor(corn_sub), method = "circle", type = "upper")

Wow. Pro. Thanks corrplot.

Finally, here’s a bit of data wrangling that will let you extract the pairs of variables with a correlation above a certain threshold:

res <- as.data.frame(as.table(round(cor(corn_sub),2)))
res <- res %>% filter(Freq != 1) %>%  # remove correlation of variable with itself
  filter(Freq > 0.75) 

knitr::kable(res, caption = "Variable pairs with high correlation")
Variable pairs with high correlation
Var1 Var2 Freq
PERC_NATURAL_COVER FULL_OWNER 0.77
FULL_OWNER PERC_NATURAL_COVER 0.77

  1. The Greek letter \(\rho\), pronounced “row”, is used to represent the population correlation, while the letter \(r\) is used for the sample correlation.

  2. RMarkdown tip: To embed big formulas in RMarkdown use two dollar signs and a space on each side of the formula ($$ formula $$), then use LaTex to write out the formula.

  3. Note that if you have missing data in either variable, you’d want to add na.rm=T to the cor() function. Remember, though, that it’s always a good idea to know why you have missing data and to consider how exclusion of this data may bias results.

  4. There are tons of options for your correlation plot detailed on the corrplot package documentation.