Objectives

  • Learn how to use ggplot to visualize your data.
  • Continue to ponder the shocking inequality among countries using the gapminder dataset.
  • Make your first plots with ggplot!

Assumptions

I’m assuming you’re already familiar with the dplyr tools discussed in the previous chapter. I’m also assuming you’ll be really impressed by ggplot by the end of this tutorial.


Packages and Data

We’ll be working with the gapminder dataset again in this chapter, so make sure you’ve installed the dslabs package and loaded it in your library. We’ll also need the tidyverse, which contains our old friend dplyr and our soon-to-be friend ggplot2:

library(tidyverse)
library(dslabs)
data(gapminder)

Remember that our gapminder dataset includes the following variables:

  • country
  • year
  • infant_mortality or infant deaths per 1000
  • life_expectancy or life expectancy
  • fertility or the average number of children per woman
  • population or country population
  • gdp GDP according to the World Bank
  • continent
  • region or geographical region

Let’s imagine that we are a team of influential data scientists who have been asked by the United Nations to create visual summaries of global differences in population, life expectancy, and GDP per capita. The UN wants to use this information to highlight global inequality and to justify important funding decisions that will ultimately impact people’s lives. Let’s work together to create a couple of powerful visualizations that illustrate important differences through time and across space in these important indicators of well-being.


Introduction to ggplot2

Manipulating data.frames with dplyr is all fine and good, but the fun (yes, I’m telling you, this stuff can be fun!) really starts when you visualize your data. Yet again, the tidyverse dominates with a powerful package called ggplot2. ggplot2 makes it easy for you to create beautiful data visualizations. Check out the ggplot gallery if you don’t believe me! This lab will give you a very short introduction to ggplot2. We’ll use this package quite a bit and eventually learn how to plot spatial data with ggplot2, so pay attention!

Let’s start by plotting data from a single country. Let’s say we’re interested in visualizing how life expectancy has changed from the 1950s to present in the United States. Well, with dplyr it’s now easy for us to pull out data for the U.S. from our larger data.frame and assign it to a new, smaller data.frame called us.

us <- gapminder %>%
  filter(country == "United States")  # filter out rows where country is US

Easy! Now to plot this. When plotting with ggplot you start by calling the ggplot() function. This creates a blank plot to which you can add data. The first argument of the ggplot() functions is the dataset you want to plot. In our case, this is the us data.frame we just built:

ggplot(data=us)

For now the plot is blank because we haven’t told ggplot2 how to deal with the data, i.e. what to put on the x and y axis, what type of graph to create (point, line, bar, etc). We can add additional layers to the plot by using the + symbol. Each layer provides more information about how we’d like to plot the data. Say we want to plot points indicating life expectancy through time. We can use the geom_point function to plot points:

ggplot(data=us) +
  geom_point(mapping = aes(x = year, y = life_expectancy))

You can get a full list of the types of plots at this website under the Layer:geoms section. We’ll work with quite a few in this class. The geom_point() geometry function takes a mapping argument in which we specify the aesthetics aes and indicate which variable we’d like to plot on the x axis (year) and which we’d like to plot on the y axis lifeExp. I know, this isn’t the most elegant way to do this, but once you get past the mapping/aesthetic specifications, adding additional detail is very easy. Unlike dplyr, which use pipes (%>%), ggplot uses layering, symbolized with a plus sign (+). Let’s add a few new details to our plot to see how layering works. It could use better axis labels and a clear title. It could also be nice to change the color of the points to make them stand out a bit more:

ggplot(data=us) +
  geom_point(mapping = aes(x = year, y = life_expectancy), color = "blue") +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy in the US")

Much better. Depending on what we want to plot, we can change the geometry object we use. If we want a smoothed line rather than dots, we could replace geom_point() with geom_smooth():

ggplot(data=us) +
  geom_smooth(mapping = aes(x = year, y = life_expectancy), color = "blue") +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy in the US")

Another cool trick is that we can pipe %>% our dplyr manipulations straight into a ggplot(). Let’s try this with the U.S.

gapminder %>%
  filter(country == "United States") %>%
  ggplot() + # switch to layering with the plus sign once you call the ggplot() function
  geom_smooth(mapping = aes(x = year, y = life_expectancy), color = "blue") +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy in the US")

If we wanted to make a similar plot for another country, we’d only need to change that small part of the code. Let’s make a life expectancy plot for Sierra Leone using this approach:

gapminder %>%
  filter(country == "Sierra Leone") %>% # <-- just change country name!
  ggplot() +
  geom_smooth(mapping = aes(x = year, y = life_expectancy), color = "blue") +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy")

What if we wanted to compare life expectancy in the United States with life expectancy in Sierra Leone in a single plot?

gapminder %>%
  filter(country %in% c("Sierra Leone", "United States")) %>%
  ggplot() +
  geom_smooth(mapping = aes(x = year, y = life_expectancy, color = country)) +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy")

First, look at the data and react to what this means in the real world. Life expectancy in the US has consistently remained 20 to 30 years higher than in Sierra Leone over the last half-century.

Second, remember, if you want to filter using multiple criteria, use %in% rather than ==. This selects all rows with country equal to the countries in the list we created using c(). Since our data.frame contains information from two countries, we can add an argument to the geom_smooth() function that tells ggplot() to group observations by country and to symbolize them using two different colors (color=country). Alternatively, we could add the argument group=country, which would create two separate lines of the same color. Try it!

What if we wanted to add our original points to this visualization? We could just add back the geom_point() geometry we used earlier:

gapminder %>%
  filter(country %in% c("Sierra Leone", "United States")) %>%
  ggplot() +
  geom_smooth(mapping = aes(x = year, y = life_expectancy, color = country)) +
  geom_point(mapping = aes(x = year, y = life_expectancy)) +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy")

And what if we wanted to do something really cool like, say, change the size of the points to reflect the country’s population in each year?

gapminder %>%
  filter(country %in% c("Sierra Leone", "United States")) %>%
  ggplot() +
  geom_smooth(mapping = aes(x = year, y = life_expectancy, color = country)) +
  geom_point(mapping = aes(x = year, y = life_expectancy, size = population)) +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy")

Every ggplot you will ever make (including maps) follows a similar logic. Start by calling the ggplot() function. Tell ggplot() which dataset you want to visualize. Then slowly add (+) the geometries and other information you want to visualize.


Improving our visualization

Go ahead and say it, that last viz was U-G-L-Y. In what follows, I’m going to show you some of my favorite tricks to improve the quality of visualizations made in ggplot. Let’s revisit one of our plots above. I’m going to save this plot in an object (p). This means if/as we want to change the plot, we simply have to call the plot and add additional layers! Much easier than re-copying the code a million times!

p <- gapminder %>%
  filter(country %in% c("Sierra Leone", "United States")) %>%
  ggplot() +
  geom_smooth(mapping = aes(x = year, y = life_expectancy, color = country)) +
  geom_point(mapping = aes(x = year, y = life_expectancy)) +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy")

p

Ok, there are a few things that could be improved here:

  • The legend is weird.
  • We can do better than the default ggplot gray square background.
  • Let’s change the color of the lines and points.
  • Lots more…?

Themes

One of my favorite things about ggplot is the ability to use themes. A theme sets the grid marks, grid color, axis label text and size, and other parameters. The default theme in ggplot is OK, but I’m not a fan of the big gray background . Changing your theme is a very easy way to change your plot. Everything you ever wanted to know about themes can be found here but let’s play with a few quickly.

theme_bw() remove the gray background:

p +
  theme_bw()

p + 
  theme_minimal()

p +
  theme_dark()

If you want to go theme-crazy (YES), install the ggthemes package:

library(ggthemes)
p +
  theme_tufte()

I love me some Tufte. To read more about his work, including the creation of the box and whisker plot we’ll learn about next week, check this out.

# The Economist
p + 
  theme_economist()

You can see all ggthemes here. You can also explore the many many ways you can amp up your ggplotting skills (animations anyone?) here.


Legends

Moving forward, I’ll go with the theme_bw(). Let’s now shift to altering our ugly legend. Changing the legend title is as easy as:

p +
  theme_bw() +
  labs(color = "Country")

Note that if your legend refers to something other than color (size, alpha, fill), then you’d just use that aesthetic instead of color (so in the labs() fuction, add an argument like fill = "NAME OF FILL VARIABLE"). What if we just want to drop the legend title, since it’s fairly clear that we’re talking about countries here:

p +
  theme_bw() +
  theme(legend.title = element_blank())

Finally, if you’re a little OCD about your visualizations, you can drop the gray background in the legend with the following:

p +
  theme_bw() +
  theme(legend.title = element_blank()) + 
  guides(color=guide_legend(override.aes=list(fill=NA)))

Nice. Notice that if you look at ALL of the code used to make p, it can be very overwhelming. IF, however, you think of this as a layered graphic and approach things (with some help from StackExchange) one layer at a time, you can easily make fairly complex graphics.

Now what if we want to remove the legend altogether?

p + 
  theme_bw() +
  theme(legend.position="none")

You can read everything you ever wanted to know about ggplot legends here and also here. This includes more on how to move the legend, change the background of the legend, and alter the elements in the legend.


Axes

Ok, let’s update our plot with the new legend that drops the “Country” title:

# this replaces the original p object with a new object with our legend edits
p <- p +
  theme_bw() +
  theme(legend.title = element_blank()) + 
  guides(color=guide_legend(override.aes=list(fill=NA)))

You can change everything about the axis labels, from font type to font size. This website provides a great overview of altering axes.

The basic formula for changing axis ticks is:

# x axis tick mark labels
p + theme(axis.text.x= element_text(family, face, colour, size))
# y axis tick mark labels
p + theme(axis.text.y = element_text(family, face, colour, size))

Where:

  • family : font family
  • face : font face. Possible values are “plain”, “italica”, “bold”, and “bold.italica
  • color : text color
  • size : text size in pts
  • angle : angle (in [0, 360])

So imagine we want to increase the size of the numbers on the axis and tilt these numbers at 45 degrees. Let’s also make them blue using the hex-code for aqua.

p + theme(axis.text.x= element_text(face = "bold", colour = "#00FFFF", size = 14, angle = 45))

Ugly, but if we wanted to do the same for the y-axis, we’d just change axis.text.x to axis.text.y. We can also hide axis tick marks as follows:

p + theme(axis.text.x= element_blank())

If you review the material on this website you can also learn how to change axis lines, tick resolution, tick mark labels, and the order of items in your plot.


Title and labels

Here’s a full overview of how to edit the title and labels in ggplot. I’ll overview some of the “best of” below. The basic formula for altering these elements of your plot is:

# main title
p + theme(plot.title = element_text(family, face, colour, size))
# x axis title 
p + theme(axis.title.x = element_text(family, face, colour, size))
# y axis title
p + theme(axis.title.y = element_text(family, face, colour, size))

Where:

  • family : font family
  • face : font face. Possible values are “plain”, “italica”, “bold”, and “bold.italica”
  • colour : text color
  • size : text size in pts
  • hjust : horizontal justification (in [0, 1])
  • vjust : vertical justification (in [0, 1])
  • lineheight : line height. In multi-line text, the lineheight argument is used to change the spacing between lines.
  • color : an alias for colour

Here’s an example in which we make the title crazy:

p + 
  labs(title = "This. Plot. Is. On. Fireeee.") +
  theme(plot.title = element_text(color = "orange", size = 20, face = "bold.italic"))

You can remove axis labels as follows:

p + theme(axis.title.y = element_blank(),
          axis.title.x = element_blank())

Now, to actually edit the text content of titles and axis labels, I use the labs function:

p +
  labs(x = "XAXIS LABEL HERE",
       y = "YAXIS LABEL HERE",
       title = "TITLE HERE",
       subtitle = "Nice, subtitle here",
       caption = "Oh wow, caption here!") # this is also where you can add legend titles for size, color, fill etc...


Color

A great overview of color in R can be found here. We’ll focus mostly on playing with color using ggplot2, but if you want to learn more about color manipulation using base R, check out this post.

What if we want to override the default color palette assigned to our box and whiskers by ggplot? We can do this manually by selecting the appropriate hex-codes for colors in the plot:

p +
  scale_color_manual(values = c("#6C443B", "#A93FD3")) 

What’s that crazy #6C...? That’s a HEX code. It’s a code that tells your computer the exact color you want to use. Of course, you can also use basic color words like “orange” or “yellow”, but sometimes HEX codes are more fun. I like this website for finding colors

If you’re like me and not very good at manually selecting colors, you can rely on a color palette already built in R. One of the go-to packages for color manipulation in R is the RColorBrewer package. Be sure it’s installed on your machine before you proceed!

library(RColorBrewer)
display.brewer.all()

The RColorBrewer package contains three general types of palettes:

  • Sequential: sequences of numbers that run from high to low
  • Qualitative: ideal for non-ordered categorical things (think factors)
  • Diverging: great for variables that are centered around zero, where negative and positive values have different interpretations.

Once you find a palette you like, you can visualize it as follows:

display.brewer.pal(n=8, name="Dark2")

Here, the n attribute is the number of discrete colors you want in the palette.

Let’s add a palette we dig to our plot:

p + 
  scale_color_brewer(palette = "Dark2")

This randomly selects two colors from that palette and applies it to our groups. The sky is the limit when it comes to colors and ggplot. For example, if you’re a big Wes Anderson fan, then try this:

library(wesanderson)
names(wes_palettes)
##  [1] "BottleRocket1"  "BottleRocket2"  "Rushmore1"      "Rushmore"      
##  [5] "Royal1"         "Royal2"         "Zissou1"        "Darjeeling1"   
##  [9] "Darjeeling2"    "Chevalier1"     "FantasticFox1"  "Moonrise1"     
## [13] "Moonrise2"      "Moonrise3"      "Cavalcanti1"    "GrandBudapest1"
## [17] "GrandBudapest2" "IsleofDogs1"    "IsleofDogs2"
wes_palette("Darjeeling1")

If you’re a bit more boring and really want a gray-scale plot, that’s easy too:

p + 
  scale_color_grey()


Colorblind-friendly palettes

In recent years, there has been growing awareness about using palettes that are both color blind friendly and that transfer well to gray-scale (i.e. when converted to black and white).

library(viridis)
p + 
  scale_color_viridis(discrete=T, option="magma") 

Color options include magma, plasma, viridis, inferno, and cividis.

The dichromat package also contains color schemes suitable to folks who are color blind:

library(dichromat)
names(dichromat::colorschemes)
##  [1] "BrowntoBlue.10"         "BrowntoBlue.12"         "BluetoDarkOrange.12"   
##  [4] "BluetoDarkOrange.18"    "DarkRedtoBlue.12"       "DarkRedtoBlue.18"      
##  [7] "BluetoGreen.14"         "BluetoGray.8"           "BluetoOrangeRed.14"    
## [10] "BluetoOrange.10"        "BluetoOrange.12"        "BluetoOrange.8"        
## [13] "LightBluetoDarkBlue.10" "LightBluetoDarkBlue.7"  "Categorical.12"        
## [16] "GreentoMagenta.16"      "SteppedSequential.5"

I’ve found this color-blindness simulator helpful when thinking about palette choices.


Our final plot

Let’s wrap this all up in one lovely plot that includes some additional countries:

gapminder %>%
  filter(country %in% c("Sierra Leone", "United States", "Italy", "Nigeria", "India")) %>%
  ggplot() +
  geom_smooth(mapping = aes(x = year, y = life_expectancy, color = country)) +
  geom_point(mapping = aes(x = year, y = life_expectancy), alpha = 0.2) + # alpha makes points transparent
  labs(x = "",
       y = "Life expectancy",
       title = "Life expectancy",
       caption = "Source: https://www.cdc.gov/500cities/") +
  theme_minimal() +
  theme(legend.title = element_blank()) + 
  guides(color=guide_legend(override.aes=list(fill=NA))) +
  scale_color_manual(values = wes_palette(n=5, name="Darjeeling1"))


Saving plots

When working with ggplot, save figures using the ggsave() function. Note that to use this function, you’ll need to dump your ggplot into a variable:

my_sweet_plot <- ggplot(data = my_data) +
  geom_point(aes(x = YEAR, y = VALUE))

ggsave(my_sweet_plot, "./myfigure.png")

You can read more about writing your figures to a other types of files here.


Critical Reflection Questions

  1. What makes a visualization a good visualization? Work together to come up with a set of criteria for our visualizations. To do this, each of you will go out and search the InterWebs for a data visualization you find particularly awful or particularly epic. Then do the following:
  • Share the visualization with classmates and list the factors that make this visualization epic or awful.
  • React to another classmate’s visualization and add factors that you think make this visualization epic or awful.
  • Then, we’ll discuss your examples in class as we work towards a set of “visualization best practice” rules.

DIY Questions

Get as close as you can to re-creating this visualization from the gapminder dataset in the dslabs package:


Additional Resources