Objectives


Helpful Readings


Packages

Make sure the tidyverse suite of packages are installed on your machine and loaded in your R session. The tidyverse includes both the dplyr and ggplot2 packages.

library(tidyverse)

You may also need:

library(RColorBrewer)
library(wesanderson)
library(viridis)
library(dichromat)
library(ggthemes)

Our data

Using your new dplyr skills to tidy your data prior to graphing will make the whole graphing process significantly easier. Make sure all of your variables live together peacefully in a tidy data.frame before you start plotting! Today, I’ve done the tidying for you, but in the future, you’ll need to be sure to clean things up before you dive into visualization.

This week we will work with the 500 Cities dataset. Here’s a brief introduction to this data:

“The 500 Cities project is a collaboration between CDC, the Robert Wood Johnson Foundation, and the CDC Foundation. The purpose of the 500 Cities Project is to provide city- and census tract-level small area estimates for chronic disease risk factors, health outcomes, and clinical preventive service use for the largest 500 cities in the United States. These small area estimates will allow cities and local health departments to better understand the burden and geographic distribution of health-related variables in their jurisdictions, and assist them in planning public health interventions.”

The project website has fantastic mapping and visualization tools, so go check it out. You can find specific information about each of the health variables in the dataset here. I’ve cleaned and subset the original large dataset, and this clean version is what we will work with today. First, let’s load the data. I’ve already tidyed the data and saved the data as a .RDS file using saveRDS(). RDS files are an R-specific file structure that stores objects just as they were in R - i.e., if the variable you write out is a data.frame, then the variable you read back in will also be a data.frame. Let’s load the .RDS file using readRDS() and use the glimpse() function from the tidyverse set of packages to inspect this dataset:

health <- readRDS("./data/health.RDS")
glimpse(health)
## Observations: 500
## Variables: 10
## $ CityName         <fct> Abilene, Akron, Alameda, Albany, Albany, Albu...
## $ StateAbbr        <fct> TX, OH, CA, GA, NY, NM, VA, CA, TX, PA, TX, C...
## $ PopulationCount  <int> 117063, 199110, 73812, 77434, 97856, 545852, ...
## $ BingeDrinking    <dbl> 16.2, 14.8, 15.0, 10.9, 15.5, 14.5, 15.1, 12....
## $ Smoking          <dbl> 19.6, 26.8, 11.9, 23.7, 19.0, 18.8, 13.0, 12....
## $ MentalHealth     <dbl> 11.6, 15.3, 9.8, 16.2, 13.2, 11.6, 8.4, 10.1,...
## $ PhysicalActivity <dbl> 27.7, 31.0, 18.7, 33.1, 26.1, 20.4, 17.6, 24....
## $ Obesity          <dbl> 33.7, 37.3, 18.7, 40.4, 31.1, 25.5, 23.3, 18....
## $ PoorHealth       <dbl> 12.6, 15.5, 9.6, 17.4, 13.1, 12.1, 8.4, 11.4,...
## $ LowSleep         <dbl> 35.4, 44.1, 32.3, 46.9, 39.7, 32.8, 34.5, 38....

Here’s some additional information about each of the variables in the dataset:


Statistics Reminder

Center

Before we dive into visualizing the “center” of variables using ggplot2, let’s review how we compute statistics of center in R. Say we want to compute the mean, median, and mode of the vector x I have created below:

x <- c(1, 2, 2, 3, 4, 5, 5, 5, 6)

R makes finding a vector’s mean, median, and mode easy:

mean(x)
## [1] 3.666667
median(x)
## [1] 4
mode(x)
## [1] "numeric"

Wait, why does the mode() function return "numeric"? Type ?mode in the console. The mode() function tells us the storage mode of an object (i.e. "interger", "double", "character"). Let’s make our own function to compute the mode of a vector:

compute_mode <- function(v) {
   uniq <- unique(v) # identifies unique values of x
   uniq[which.max(tabulate(match(v, uniq)))]  # counts the number of instances of each value
}

compute_mode(x)
## [1] 5

That’s better! Note that if a variable contains missing data…

x_NA <- c(1, 2, 2, 3, 4, 5, 5, 5, 6, NA)
mean(x_NA)
## [1] NA
median(x_NA)
## [1] NA

…then mean() and median() will return NA. This tells us that our variable contains missing data. We can tell R to compute these statistics while ignoring the missing data by adding the argument na.rm=T to the mean() and median()` function.

mean(x, na.rm=T)
## [1] 3.666667
median(x, na.rm=T)
## [1] 4

Our compute_mode() function will only return NA when NA is the most frequently occurring value in our variable:

compute_mode(x_NA)
## [1] 5
x_MEGA_NA <- c(1, 2, 2, 3, 4, NA, NA, NA, 6)
compute_mode(x_MEGA_NA)
## [1] NA

Let’s take a second to discuss when and why you should use each of these statistics. Mean is a non-resistant statistic, which means it is easily influenced by extreme values, or outliers. Median is a resistant statistic, so it is a better choice when you have extreme values in your data. First, let’s calculate mean and median without extreme values:

x <- c(1, 2, 2, 3, 4, 5, 5, 5, 6)
print(paste("Here's the mean with no outliers:", mean(x)))
## [1] "Here's the mean with no outliers: 3.66666666666667"
print(paste("Here's the median with no outliers:", median(x)))
## [1] "Here's the median with no outliers: 4"

Mean and median aren’t that different. And now with an extreme value of 60 added to the vector:

x <- c(1, 2, 2, 3, 4, 5, 5, 5, 60)
print(paste("Here's the mean with an outlier:", mean(x)))
## [1] "Here's the mean with an outlier: 9.66666666666667"
print(paste("Here's the median with an outlier:", median(x)))
## [1] "Here's the median with an outlier: 4"

The high value of 60 pulls up the mean of the dataset, however the mode is unchanged.

We can use these same functions to take the mean, median, and mode or larger variables, such as the columns in our health dataset. Say, for example, you want to know the mean, median, and mode of populations in the 500 cities in the dataset:

print(paste("Mean: ", mean(health$PopulationCount)))
## [1] "Mean:  206041.616"
print(paste("Median: ", median(health$PopulationCount)))
## [1] "Median:  106106"
print(paste("Mode: ", compute_mode(health$PopulationCount)))
## [1] "Mode:  106433"

Note that the mean is much higher than the median and mode. This is because of several high population outliers in our dataset. We can quickly spot these outliers by looking at the top cities in terms of population:

health %>% 
  arrange(desc(PopulationCount)) %>% 
  slice(1:10)  # this extracts the first 10 rows
##        CityName StateAbbr PopulationCount BingeDrinking Smoking
## 1      New York        NY         8175133          15.5    16.8
## 2   Los Angeles        CA         3792621          15.1    15.6
## 3       Chicago        IL         2695598          18.7    19.1
## 4       Houston        TX         2099451          14.9    16.4
## 5  Philadelphia        PA         1526006          17.5    25.5
## 6       Phoenix        AZ         1445632          14.5    19.6
## 7   San Antonio        TX         1327407          15.7    15.2
## 8     San Diego        CA         1307402          17.3    13.1
## 9        Dallas        TX         1197816          15.2    17.9
## 10     Honolulu        HI          953207          19.0    15.5
##    MentalHealth PhysicalActivity Obesity PoorHealth LowSleep
## 1          12.6             28.8    26.3       13.1     41.0
## 2          13.0             25.1    26.7       14.7     37.5
## 3          12.1             27.6    31.3       13.5     37.4
## 4          11.4             29.7    33.9       12.9     36.0
## 5          15.4             29.0    33.9       15.2     44.3
## 6          13.1             24.1    30.1       13.8     35.6
## 7          10.8             29.5    32.9       12.9     36.8
## 8          11.0             21.0    22.2       11.0     34.6
## 9          11.9             30.5    34.3       13.8     34.9
## 10          8.9             22.2    22.1        8.8     48.0

High population cities like New York pull up the mean of the PopulationCount variable.

We can create visualizations that, say, plot the mean of key variables across states, but this collapses multiple observations (i.e. several cities within a state) to a single value. It does not visualize the variability (or spread) of the observations within each state. Let’s look at some cool ways to visualize both the center and the spread of variables…


Spread

Once we know the central tendency of each variable, it’s good to explore how far the other values of the variable are from the central value. This is known as the spread of the data. The simplest indicator of the spread of a variable is the range, or the difference between the maximum and minimum values of the variable. In the example above with our vector x, the range would be the highest value, 6 minus the lowest value, 1, or 5.

Let’s plot the range of the variable LowSleep in Utah, New York, California, and Tennessee against the average value of LowSleep across all 500 cities:

health %>% 
  filter(StateAbbr %in% c("UT", "NY", "CA", "TN", "WY", "FL")) %>%  
  ggplot() +  # pipe filtered data into ggplot() function
  geom_point(aes(x = StateAbbr, y = LowSleep), size = 2, alpha = 0.4) +  # add point geom
  geom_hline(yintercept=mean(health$LowSleep), color = "red") +  # add horizontal line geom
  ylab("% adults sleeping less than 7 hours") +  # change y-axis label
  xlab("") +  # change x-axis label
  ggtitle("500 Cities Health Outcomes")  # add a title

This figure indicates that for all of the cities located in the state of New York, the percent of sampled adults sleeping less than 7 hours was well above the average of the sample for the entire country. It also shows us that we only have one observation for the state of Wyoming (Cheyenne).

A frequently-used indicator of a variable’s spread is the variance. The variance of a sampled variable is computed as follows:

\[ \sigma^2 = \frac{\Sigma (X - \bar{X})^2}{n-1} \]

…where \(X\) is the observed value, \(\bar{X}\) is the mean of the sample, and n is the number of observations in the sample. This statistic gives us a sense of how much a variable deviates from its mean (\(X - \bar{X}\)). The square root of the variance is the standard deviation, or:

\[ \sigma= \sqrt{\frac{\Sigma (X - \bar{X})^2}{n-1}} \]

Like the variance, the standard deviation tells us how far observations in a dataset are from the mean value. A low standard deviation indicates that most of the observations are close to the average value; a high standard deviation indicates that the numbers are spread out away from the average value of the variable. Let’s visualize the mean and standard deviation of the variable BingeDrinking in a subset of states:

health %>% 
  filter(StateAbbr %in% c("UT", "NY", "CA", "TN", "WY", "FL")) %>% 
  group_by(StateAbbr) %>%  # group data by state and compute mn/sd
  summarize(meanBD = mean(BingeDrinking), sdBD = sd(BingeDrinking)) %>%  
  ggplot() + # pipe data into ggplot()
  geom_point(aes(x = StateAbbr, y = meanBD)) +  # add point at meanBD
  geom_errorbar(aes(x = StateAbbr, 
                    ymin = meanBD - sdBD, 
                    ymax = meanBD + sdBD), width = 0.2) +  # add error bars representing sd
  xlab("") +
  ylab("Binge drinking rates") +
  ggtitle("500 Cities Health Outcomes")

Here, we manually compute the mean (mean()) and standard deviation (sd()) of a grouped subset of our data and then use a combination of ggplot2 tools to create a visualization of the mean (point) and spread (lines) of our data. We are unable to compute a standard deviation for Wyoming since there’s only one observation in the state.

Remember that estimates based on small populations tend to show more variance than estimates based on large populations. Let’s create a funnel plot of our population versus values of BingeDrinking to illustrate this:

ggplot(health) +
  geom_point(aes(x = PopulationCount, y = BingeDrinking), alpha = 0.3) +
  xlab("Population") +
  ylab("Rate of binge drinking") +
  ggtitle("500 Cities Health Outcomes") 

Notice how as population increases, rates of binge drinking stabilize between 14 and 17 percent. Cities with populations less than 250,000 have much more variability in binge drinking rates.

Now that you’ve reviewed basic statistics of center and spread, let’s move on to some cool visualization tools that best help us see amounts, distributions, and proportions. In future classes, we’ll also focus on visualizing spatial patterns, change through time, and associations between variables.


Visualizing Spread

The easiest way to visualize the spread of many variables at once is to create a box and whisker plot. These plots visualize the median, the first and third quartiles (the 25th and 75th percentiles), and any outliers that fall beyond the 95th percentile of the data in one elegant plot. They also automatically group data and compute summary statistics, so you can skip that step in your code!

health %>% 
  filter(StateAbbr %in% c("UT", "NY", "CA", "TN", "WY", "FL")) %>% 
  ggplot() +
  geom_boxplot(aes(x = StateAbbr, y = BingeDrinking, fill = StateAbbr)) +
  xlab("") +
  ylab("Rate of binge drinking among adults") 

The median of each distribution is shown with the horizontal dark line. The box around the median represents the 25% to 75% quartiles in the distribution, and the thin vertical lines represent the area describing 95% of the distribution. See that dot in the Florida column? That’s an outlier, or a city whose rate of binge drinking falls outside of the 95% distribution for the dataset. This city has abnormally low rates of binge drinking based on the distribution of city observations for the state of Florida. We can figure out which city this is using our dplyr skills:

health %>% 
  filter(StateAbbr == "FL") %>% 
  filter(BingeDrinking == min(BingeDrinking))
##        CityName StateAbbr PopulationCount BingeDrinking Smoking
## 1 Miami Gardens        FL          107167          12.9    21.9
##   MentalHealth PhysicalActivity Obesity PoorHealth LowSleep
## 1         15.3             31.4    35.5       15.2     43.7

Visualizing a Single Distribution

A very quick way to visualize the shape of a variable is to plot a histogram of that variable. You can do this using either the base R hist() function or using ggplot().

hist(health$LowSleep, breaks = 100, xlab = "Low sleep rates", ylab = "Frequency", main = "Histogram of low sleep rates")

Changing the breaks() argument changes the number of bins into which the variable values are placed. Try it!
You can also build a histogram using ggplot() as follows:

ggplot(data = health) +
  geom_histogram(aes(x = LowSleep), bins = 50) +
  geom_vline(xintercept=mean(health$LowSleep), color = "red") +
  geom_vline(xintercept=median(health$LowSleep), color = "blue") +
    xlab("Rate of respondents sleeping less than 7 hours") +
  ylab("Frequency") +
  ggtitle("500 Cities Health Outcomes") 

Using geom_vline(), I added a vertical red line showing the mean value of LowSleep and a blue line showing the median value. Does it make sense to you that the median value is slightly lower than the mean?


Visualizing Many Distributions

Histograms are cool and all, but what if you want to visualize MULTIPLE distributions at once? Enter the violin plot:

health %>% 
  filter(StateAbbr %in% c("FL", "NY", "CA", "AR")) %>%
  ggplot(aes(x = StateAbbr, y = BingeDrinking)) +  
  geom_violin(aes(fill = StateAbbr)) +
  geom_boxplot(width = 0.1) +
  xlab("") +
  ylab("Rate of binge drinking") +
  ggtitle("500 Cities Health Outcomes") +
  theme(legend.title = element_blank())  # this removes the StateAbbr title on the legend, try building the plot without it!

This slices data into quartiles like a box and whisker plot, but also shows us the shape of the distribution! Cool! Note too that you can put all of the aesthetic information in the ggplot() function if it will be the same across multiple geometries (geom_violin() and geom_boxplot()). This way you don’t have to type the aes(...) for each geometry.


Making an awesome plot

Let’s go back to our box and whisker plot and start playing with color and aesthetics to make our plot truly awesome:

bwplot <- health %>% 
  filter(StateAbbr %in% c("UT", "NY", "CA", "TN", "FL")) %>% 
  ggplot() +
  geom_boxplot(aes(x = StateAbbr, y = BingeDrinking, fill = StateAbbr)) +
  xlab("") +
  ylab("Rate of binge drinking among adults") 

bwplot

We know how to add titles (ggtitle()) and axis labels (xlab() and ylab()) to our plots. What if we also want to do the following:

Let’s work through each of these aesthetic changes one-by-one. Note that the best way to figure out how to change some specific ggplot aesthetic is to ask our friend the internet. To make changing the aesthetic easier, I’ve dumped our plot into a variable called bwplot. I can continue adding things to this plot using + to change the plot contents. See below!

Legends

Changing the legend title is as easy as:

bwplot +
  guides(fill=guide_legend(title="U.S. States"))

What if we want to remove the legend altogether?

bwplot + theme(legend.position="none")

You can read everything you ever wanted to know about ggplot legends here. This includes more on how to move the legend, change the background of the legend, and alter the elements in the legend.

Axes

You can change everything about the axis labels, from font type to font size. This website provides a great overview of altering axes.

The basic formula for changing axis ticks is:

# x axis tick mark labels
bwplor + theme(axis.text.x= element_text(family, face, colour, size))
# y axis tick mark labels
bwplot + theme(axis.text.y = element_text(family, face, colour, size))

Where:

  • family : font family
  • face : font face. Possible values are “plain”, “italica”, “bold”, and “bold.italica”
  • colour : text color
  • size : text size in pts
  • angle : angle (in [0, 360])

So imagine we want to increase the size of the numbers on the axis and tilt these numbers at 45 degrees. Let’s also make them blue using the hex-code for aqua.

bwplot + theme(axis.text.x= element_text(face = "bold", colour = "#00FFFF", size = 14, angle = 45))

Ugly, but if we wanted to do the same for the y-axis, we’d just change axis.text.x to axis.text.y. We can also hide axis tick marks as follows:

bwplot + theme(axis.text.x= element_blank())

If you review the material on this website you can also learn how to change axis lines, tick resolution, tick mark labels, and the order of items in your plot.

Title and labels

Here’s a full overview of how to edit the title and labels in ggplot. I’ll overview some of the “best of” below. The basic formula for altering these elements of your plot is:

# main title
bwplot + theme(plot.title = element_text(family, face, colour, size))
# x axis title 
bwplot + theme(axis.title.x = element_text(family, face, colour, size))
# y axis title
bwplot + theme(axis.title.y = element_text(family, face, colour, size))

Where:

  • family : font family
  • face : font face. Possible values are “plain”, “italica”, “bold”, and “bold.italica”
  • colour : text color
  • size : text size in pts
  • hjust : horizontal justification (in [0, 1])
  • vjust : vertical justification (in [0, 1])
  • lineheight : line height. In multi-line text, the lineheight argument is used to change the spacing between lines.
  • color : an alias for colour

Here’s an example in which we make the title crazy:

bwplot + 
  ggtitle("This. Plot. Is. On. Fireeee.") +
  theme(plot.title = element_text(color = "orange", size = 20, face = "bold.italic", vjust = 10))

You can remove axis labels as follows:

bwplot + theme(axis.title.y =  element_blank())

Reordering elements

If we want to reorder the box and whiskers from high to low drinking rates, we need to go back into the dplyr part of our plot and use the reorder() function that takes the category to reorder (StateAbbr) and the other variable by which to reorder (BingeDrinking):

health %>% 
  filter(StateAbbr %in% c("UT", "NY", "CA", "TN", "FL")) %>% 
  ggplot() +
  geom_boxplot(aes(x = reorder(StateAbbr, desc(BingeDrinking)), y = BingeDrinking, fill = StateAbbr)) +
  xlab("") +
  ylab("Rate of binge drinking among adults") 

Color

A great overview of color in R can be found here. We’ll focus mostly on playing with color using ggplot2, but if you want to learn more about color manipulation using base R, check out this post.

What if we want to override the default color palette assigned to our box and whiskers by ggplot? We can do this manually by selecting the appropriate hex-codes for colors in the plot. I like this website for finding colors:

bwplot +
  scale_fill_manual(values = c("blue", "red", "orange", "#6C443B", "#A93FD3")) 

# note for points you need to use scale_color_manual

We can also change the hue of the color by altering the lightness (l) and the chroma (c, intensity of the color):

bwplot + scale_fill_hue(l = 40, c = 35)

# note for points you need to use scale_color_manual

If you’re like me and not very good at manually selecting colors, you can rely on a color palette already built in R. One of the go-to packages for color manipulation in R is the RColorBrewer package. Be sure it’s installed on your machine before you proceed!

library(RColorBrewer)
display.brewer.all()

The RColorBrewer package contains three general types of palettes:

  • Sequential: sequences of numbers that run from high to low
  • Qualitative: ideal for non-ordered categorical things (think factors)
  • Diverging: great for variables that are centered around zero, where negative and positive values have different interpretations.

Once you find a palette you like, you can visualize it as follows:

display.brewer.pal(n=8, name="Dark2")

Here, the n attribute is the number of discrete colors you want in the palette.

Let’s add a palette we dig to our plot:

bwplot + scale_fill_brewer(palette = "Dark2")

If you’re a big Wes Anderson fan, then try this:

library(wesanderson)
names(wes_palettes)
##  [1] "BottleRocket1"  "BottleRocket2"  "Rushmore1"      "Rushmore"      
##  [5] "Royal1"         "Royal2"         "Zissou1"        "Darjeeling1"   
##  [9] "Darjeeling2"    "Chevalier1"     "FantasticFox1"  "Moonrise1"     
## [13] "Moonrise2"      "Moonrise3"      "Cavalcanti1"    "GrandBudapest1"
## [17] "GrandBudapest2" "IsleofDogs1"    "IsleofDogs2"
wes_palette("Zissou1")

wes_palette("Darjeeling1")

bwplot + scale_fill_manual(values = wes_palette(n=5, name="Darjeeling1"))

If you’re a bit more boring and really want a gray-scale plot, that’s easy too:

bwplot + scale_fill_grey()

Color blind friendly palettes

In recent years, there has been growing awareness about using palettes that are both color blind friendly and that transfer well to gray-scale (i.e. when converted to black and white).

library(viridis)
bwplot + scale_fill_viridis(discrete=T, option="magma") 

Color options include magma, plasma, viridis, inferno, and cividis.

The dichromat package also contains color schemes suitable to folks who are color blind:

library(dichromat)
names(dichromat::colorschemes)
##  [1] "BrowntoBlue.10"         "BrowntoBlue.12"        
##  [3] "BluetoDarkOrange.12"    "BluetoDarkOrange.18"   
##  [5] "DarkRedtoBlue.12"       "DarkRedtoBlue.18"      
##  [7] "BluetoGreen.14"         "BluetoGray.8"          
##  [9] "BluetoOrangeRed.14"     "BluetoOrange.10"       
## [11] "BluetoOrange.12"        "BluetoOrange.8"        
## [13] "LightBluetoDarkBlue.10" "LightBluetoDarkBlue.7" 
## [15] "Categorical.12"         "GreentoMagenta.16"     
## [17] "SteppedSequential.5"

Themes in ggplot

Everything you ever wanted to know about themes found here but let’s play with a few quickly.

theme_bw() remove the gray background:

bwplot +
  theme_bw()

bwplot + 
  theme_minimal()

bwplot +
  theme_dark()

If you want to go theme-crazy, install the ggthemes package:

library(ggthemes)
bwplot +
  theme_tufte()

# The Economist
bwplot + 
  theme_economist()

# Wall Street Journal
bwplot +
  theme_wsj()

You can see all ggthemes here. You can also explore the many many ways you can amp up your ggplotting skills (animations anyone?) here.


Our final plot

health %>% 
  filter(StateAbbr %in% c("UT", "NY", "CA", "TN", "FL")) %>% 
  ggplot() +
  geom_boxplot(aes(x = reorder(StateAbbr, desc(BingeDrinking)), y = BingeDrinking, fill = StateAbbr)) +
  labs(title = "Rates of binge drinking among adults", subtitle = "500 Cities Dataset") +
  guides(fill=guide_legend(title="U.S. States")) +
  ylab("Drinking rates") +
  xlab("") +
  scale_fill_manual(values = wes_palette(n=5, name="Darjeeling1")) +
  theme_bw()


Saving your figures

When working with ggplot2, save figures using the ggsave() function. Note that to use this function, you’ll need to dump your ggplot into a variable:

my_sweet_plot <- ggplot(data = my_data) +
  geom_point(aes(x = YEAR, y = VALUE))

ggsave(my_sweet_plot, "./myfigure.png")

You can read more about writing your figures to a other types of files here.


Additional Resources