Objectives


Packages

Make sure the tidyverse suite of packages are installed on your machine and loaded in your R session. The tidyverse includes both the dplyr and ggplot2 packages.

library(tidyverse)

Our data

We’ll work with the 500 cities data again this week. You can read more about the data (and play around with the full dataset!) here.

health <- readRDS("./data/health.RDS")
glimpse(health)
## Observations: 500
## Variables: 10
## $ CityName         <fct> Abilene, Akron, Alameda, Albany, Albany, Albu...
## $ StateAbbr        <fct> TX, OH, CA, GA, NY, NM, VA, CA, TX, PA, TX, C...
## $ PopulationCount  <int> 117063, 199110, 73812, 77434, 97856, 545852, ...
## $ BingeDrinking    <dbl> 16.2, 14.8, 15.0, 10.9, 15.5, 14.5, 15.1, 12....
## $ Smoking          <dbl> 19.6, 26.8, 11.9, 23.7, 19.0, 18.8, 13.0, 12....
## $ MentalHealth     <dbl> 11.6, 15.3, 9.8, 16.2, 13.2, 11.6, 8.4, 10.1,...
## $ PhysicalActivity <dbl> 27.7, 31.0, 18.7, 33.1, 26.1, 20.4, 17.6, 24....
## $ Obesity          <dbl> 33.7, 37.3, 18.7, 40.4, 31.1, 25.5, 23.3, 18....
## $ PoorHealth       <dbl> 12.6, 15.5, 9.6, 17.4, 13.1, 12.1, 8.4, 11.4,...
## $ LowSleep         <dbl> 35.4, 44.1, 32.3, 46.9, 39.7, 32.8, 34.5, 38....

Here’s some additional information about each of the variables in the dataset:


One classy violin plot

This week, we’ll continue working on becoming professional-ggplotters by creating a really awesome violin plot. Violin plots are like box-and-whisker plots in that they let us quickly visually compare a single variable across groups. They are even cooler than box-and-whisker plots (I know, I know, how is it POSSIBLE to be cooler than our friends B&W!?) because not only do they show the center and spread of a variable, they also show us the shape of the distributions.

Say we’re working for the CDC and interested in allocating lots of money to states in the US to address smoking. Let’s pick five random states to compare using filter() and visualize the distributions of smoking rates among cities in each state using a violin plot (geom_violin()):

p <- health %>%
  filter(StateAbbr %in% c("UT", "FL", "NY", "ID")) %>%
  ggplot() +  
  geom_violin(aes(x = StateAbbr, y = Smoking))  
p

This plot shows us a bit more information that a box-and-whisker plot, since it also gives us a sense of the shape of city-level distributions in each state in our dataset. We’re missing, however, visual cues about the median and spread (quartiles) of each state’s smoking rates provided by default in our box-and-whisker plots. Let’s add those and update our plot object p:

p + geom_violin(aes(x = StateAbbr, y = Smoking), draw_quantiles = c(0.25, 0.5, 0.75))  

Nice! Another way to do this is:

p + geom_boxplot(aes(x = StateAbbr, y = Smoking), width = 0.2)

I’m going to stick with the first version because I think it looks much better. Ok, let’s update the colors of the violin plots. I’ll also update the legend title, reorder the box and whiskers from high to low, and update axis labels and title (see last week’s tutorial for explanation of this):

p2 <- p + 
  geom_violin(aes(x = reorder(StateAbbr, desc(Smoking)), y = Smoking, fill = StateAbbr), draw_quantiles = c(0.25, 0.5, 0.75)) +
  guides(fill=guide_legend(title="U.S. States")) +
  xlab("") +
  ylab("Smoking rates") +
  labs(title = "Rates of binge smoking among adults over 18", subtitle = "500 Cities Dataset")
p2

This looks OK, but I’d like to remove the gray background:

p3 <- p2 + theme_bw()
p3

Finally, I’d like to drop the legend on the bottom:

p4 <- p3 + theme(legend.position = "bottom")
p4

Here’s all the code needed to reproduce this figure:

final_violin <- health %>%
  filter(StateAbbr %in% c("UT", "FL", "NY", "ID")) %>%
  ggplot() +  
  geom_violin(aes(x = reorder(StateAbbr, desc(Smoking)), y = Smoking, fill = StateAbbr), draw_quantiles = c(0.25, 0.5, 0.75)) +
  guides(fill=guide_legend(title="U.S. States")) +
  xlab("") +
  ylab("Smoking rates") +
  labs(title = "Rates of binge smoking among adults over 18", subtitle = "500 Cities Dataset") +
  theme_bw() +
  theme(legend.position = "bottom") 

final_violin

Now wouldn’t it ALSO be really really cool to add the actual data points to this plot?? Let’s do it!

final_violin +
  geom_point(aes(x = reorder(StateAbbr, desc(Smoking)), y = Smoking))

In Florida, there are so many cities, it’s a bit hard to distinguish between points. Let’s “jitter” the points to facilitate visualization. This is a tool that adds random noise to our data to facilitate visualization (to prevent over-plotting). In this case, it moves the data points to the left and right so we can better see distinct cities.

final_violin +
  geom_jitter(aes(x = reorder(StateAbbr, desc(Smoking)), y = Smoking))

That’s a bit too much jitter. We can reduce the width of the jitter like so:

final_violin +
  geom_jitter(aes(x = reorder(StateAbbr, desc(Smoking)), y = Smoking), width=0.05)

Saving your figures

When working with ggplot2, save figures using the ggsave() function. Note that to use this function, you’ll need to dump your ggplot into a variable:

my_sweet_plot <- ggplot(data = my_data) +
  geom_point(aes(x = YEAR, y = VALUE))

ggsave(my_sweet_plot, "./myfigure.png")

You can read more about writing your figures to a other types of files here.


Additional Resources