Objectives

  • Learn how to use dplyr to clean and wrangle your data.
  • Learn how to compute some basic descriptive statistics in R.
  • Ponder the shocking inequality among countries using the gapminder dataset.

Assumptions

  • I am now assuming that you’ve taken the time to familiarize yourself with the basics of the R language. I will no longer explain basic concepts like data class, data mode, or variables versus observations.

Packages and Data

If you haven’t already installed the dslabs package, type install.packages("dslabs") in the Console to download the package off of the CRAN repository onto your computer. Once this package is downloaded on your computer, you can pull it into an R session the library() command. This packages contains a number of cool datasets. Today, we’ll play with the gapminder dataset, which describes the demographic and socioeconomic attributes of countries around the world. You can learn more about the Gapminder project at www.gapminder.org.

library(dslabs) # load the dslabs package
data(gapminder) # pull the gapminder data into your RStudio session

The data() function pulls the gapminder dataset that was downloaded with the dslabs package into your Global Environment (window to the right in RStudio). We’ll learn how to import other types of data in future tutorials.

You’ll also want to to install and load the tidyverse package. This is actually a suite of packages that includes dplyr for cleaning data (this chapter) and ggplot2 for visualizing data (next chapter). You can read more about the cool tools of the tidyverse here.

library(tidyverse) # this loads a suite of packages including dplyr and ggplot2

Talking to your data with dplyr

Let’s inspect the new gapminder dataset.

One of the first things I tend to do when I load a new dataset into my RStudio session is to use the head() function to look at the first few rows of the data. This lets me confirm that the data was, in fact, loaded correctly. Remember that functions are the verbs of the R programming language. Functions tend to take arguments, or inputs that give the verb a bit more information about how it should act. In our case, in order for the head() function to “return the first or last parts of a vector, matrix, table, data frame or function”, it needs to know which data it should inspect:

head(gapminder)
##               country year infant_mortality life_expectancy fertility
## 1             Albania 1960           115.40           62.87      6.19
## 2             Algeria 1960           148.20           47.50      7.65
## 3              Angola 1960           208.00           35.98      7.32
## 4 Antigua and Barbuda 1960               NA           62.97      4.43
## 5           Argentina 1960            59.87           65.39      3.11
## 6             Armenia 1960               NA           66.86      4.55
##   population          gdp continent          region
## 1    1636054           NA    Europe Southern Europe
## 2   11124892  13828152297    Africa Northern Africa
## 3    5270844           NA    Africa   Middle Africa
## 4      54681           NA  Americas       Caribbean
## 5   20619075 108322326649  Americas   South America
## 6    1867396           NA      Asia    Western Asia

The output of this line of code is simply the first six lines of the gapminder dataset. Note that you can also type View(gapminder) in the Console to see the full dataset in a new window (kinda’ like what you’d expect in Excel). You can also look at the dataset by clicking on the blue arrow to the left of the dataset in the Global Environment window.

Visually inspecting the data is useful because it lets us quickly confirm that our data was loaded correctly. For example, we can now see that our gapminder dataset includes the following columns:

  • country
  • year
  • infant_mortality or infant deaths per 1000
  • life_expectancy or life expectancy
  • fertility or the average number of children per woman
  • population or country population
  • gdp GDP according to the World Bank
  • continent
  • region or geographical region

Each of these columns represents a different variable, in this case information describing the year and attributes of different countries. We refer to each row in the table as an observation, or a discrete country-year instance in the data. We can ask R programmatically how many rows and columns are in the dataset using the following functions:

dim(gapminder) # dimensions, returned in row, column format
## [1] 10545     9
nrow(gapminder) # number of rows
## [1] 10545
ncol(gapminder) # number of columns
## [1] 9

Now let’s determine the class of each variable. We could do this individually for each variable by typing class(gapminder$variable_name), or we could use the sapply() function to apply the class function across all variables in the gapminder dataset. Remember, if you’re unfamiliar with any function, i.e. sapply(), you can ask R how it works using a question mark, i.e. ?sapply().

sapply(gapminder, class)
##          country             year infant_mortality  life_expectancy 
##         "factor"        "integer"        "numeric"        "numeric" 
##        fertility       population              gdp        continent 
##        "numeric"        "numeric"        "numeric"         "factor" 
##           region 
##         "factor"

Our categorical variables (country, continent, and region), or variables that take on a number of limited possible values, are coded as "factor" variables. Our other variables are coded as numeric (continuous numerical data) or integer (discrete valued numerical data, like year).

We can extract information for each variable in the dataset using $. For example, if we wanted to determine the range of years in the dataset we can simply type:

range(gapminder$year)
## [1] 1960 2016

What if we wanted to generate a list of the unique regions in the dataset? Simply printing out the vector gapminder$region would be insufficient since regions are repeated over multiple years and countries. Type gapminder$region in the Console to see what I mean! One of the functions I use the most when checking data is the unique() function. This function returns a vector of unique values in a larger vector. Check out this example:

x <- c(1, 1, 1, 2, 3, 4, 5)
unique(x)
## [1] 1 2 3 4 5

In the case of our gapminder data, this function could be used to generate a list of the unique regions:

unique(gapminder$region)
##  [1] Southern Europe           Northern Africa          
##  [3] Middle Africa             Caribbean                
##  [5] South America             Western Asia             
##  [7] Australia and New Zealand Western Europe           
##  [9] Southern Asia             Eastern Europe           
## [11] Central America           Western Africa           
## [13] Southern Africa           South-Eastern Asia       
## [15] Eastern Africa            Northern America         
## [17] Eastern Asia              Northern Europe          
## [19] Melanesia                 Polynesia                
## [21] Central Asia              Micronesia               
## 22 Levels: Australia and New Zealand Caribbean Central America ... Western Europe

Nice! Another thing we often want to do with our data is compute descriptive statistics like the mean, median, mode, or standard deviation. This isn’t hard in R. For example, what if we want to know the average life expectancy across time and across all countries?

mean(gapminder$life_expectancy)
## [1] 64.81162

This means that over the last 50 years, and across all countries, the average life expectancy is 64.8116226 years.

How about the mean rate of infant mortality?

mean(gapminder$infant_mortality)
## [1] NA

Wutttt!?

NA means “Not Available” and is typically used to encode missing data, or data that, for whatever reason, is not available in your dataset. It is very important to understand why data is missing and to think through the implications for any visualizations, descriptive statistics, or analyses you conduct with the data. For example, imagine that in several countries, a severe drought causes widespread famine. During this crisis, the countries are unable to report national health statistics. The national data you are analyzing may just list NA for these countries, essentially dropping them from any analyses you conduct or visualizations you create. The reality, however, is that humans continued to live in these countries and experienced very real health outcomes, likely lower than global averages due to the famine. As a data scientist, it is imperative that you locate and understand the implications of missing data in your dataset, so that as you transform your data into information to inform decision-making, you can do it in a way that is honest.

With that important caveat stated, I’m going to show you how to programatically ignore missing data so that you can still compute descriptive statistics on the data you do have. Let’s start with a simple example. Say you have a simple vector x with the following values:

x <- c(1, NA, 3, 4, 5)

For whatever reason, the second element of this vector is missing, or NA. If I try to compute the mean of this vector using the mean() function, here’s what happens:

mean(x)
## [1] NA

Anytime a vector contains missing data, most R functions will return a NA. This can be annoying, yes, but it’s actually R’s helpful reminder that “hey man, don’t forget you’ve got missing data there!” Let’s assume you totally understand why the second element is missing and the implications of this missing element for your analysis. Nice. In that case, you can go ahead and force R to ignore the missing data by adding the na.rm = T argument:

mean(x, na.rm = T)
## [1] 3.25

This second argument to the mean() function overwrites the default value of na.rm = F (here T stands for TRUE and F stands for FALSE). By turning this argument on, you’re essentially saying “R, please remove the NA values and then compute the mean.”

Ok, so how does all of this apply to our gapminder example. We can use the same argument to tell the mean() function to ignore the NA values in our infant_mortality vector:

mean(gapminder$infant_mortality, na.rm = T)
## [1] 55.30862

This means that the global average (after accounting for missing data) for infant deaths over the last ~60 years per 1,000 births is 55.3086188. Two things might stand out here. One, is that we are talking about infant mortality… this number has extremely real implications in the real world. This is why I always want you to take a step back and think about what the numbers and visualization you generate actually mean for people and planet. Your second reaction might be, huh, this is interesting, but it would be more interesting if I could zoom in and look at differences across countries and through time. Well my friend, get ready for the tidyverse:


dplyr

The dplyr package is one of a number of packages in the tidyverse set of packages that makes data wrangling, indexing, and plotting much, much easier than with base R tools (and, dare I say, fun?).

The dplyr package contains several functions (think: verbs) that make querying your data much simpler:

  • select(): select columns
  • filter(): select rows
  • arrange(): order or arrange rows
  • mutate(): create new columns
  • summarize(): summarize values (for the Brits in the room, you can also use summarise())
  • group_by(): group observations

What I love most about dplyr is that the functions are so intuitive, it often feels like I’m having a conversation with my data:

“Hey dplyr, tell me what you know about infant mortality in Sri Lanka in 2000!”

# you got it
gapminder %>% 
  filter(country == "Sri Lanka", year == 2000) %>%
  select(infant_mortality)
##   infant_mortality
## 1               14

“You know what, I changed my mind, I’d rather know about the United States in the same year…”

# easy game
gapminder %>% 
  filter(country == "United States", year == 2000) %>%
  select(infant_mortality)
##   infant_mortality
## 1              7.1

“Scratch that, I actually want to know about Belgium, France, Morocco, and Nigeria… all at once please!”

# you're starting to get on my nerves, but sure, fine
gapminder %>%
  filter(country %in% c("Belgium", "France", "Morocco", "Nigeria"), year == 2000) %>%
  select(country, infant_mortality)
##   country infant_mortality
## 1 Belgium              4.8
## 2  France              4.4
## 3 Morocco             42.2
## 4 Nigeria            112.0

A few things are happening here. First, you’re probably wondering what that crazy %>% thing is. This is called a pipe. This allows you to “pipe” the output from one function into another function. In the first example, we start with the full gapminder dataset. We feed the full dataset into the the filter() function. This function filter()s out the rows in which some condition is true, in our cases where country == "Sri Lanka" or where the country is Sri Lanka and where year == 2000 or the year is 2000. Why the double equals sign? This is important. In R, one equals sign (=) assigns value as in:

x = 1
print(x)
## [1] 1

A double equals sign tests whether something is true, so:

x == 1
## [1] TRUE
x == 2
## [1] FALSE

In our filter() function, we want to filter() out the rows where country == "Sri Lanka" and year == 2000.1 Once we filter down our dataset, we can use select to pull out the columns of interest to us (infant_mortality).

How about that last example with four countries? Here, all I did was say, hey filter(), find all rows where country is %in% this list of countries. I could also to the opposite, so filter out all rows where a condition is not true:

gapminder %>%
  filter(country != "Sri Lanka")

Or all countries that are not in a list:

gapminder %>%
  filter(!country %in% c("Belgium", "France", "Morocco", "Nigeria"))

The exclamation point here reads like the word “not”, so it’s like saying “hey dplyr, filter countries NOT in this list or NOT equal to this.” We’ll come back to this in a sec…

Not convinced yet that dplyr trumps base R? OK, say you want to know the average, maximum, and minimum GDP for Sri Lanka over the last 50 years. No problem:

gapminder %>%
  filter(country == "Sri Lanka") %>%
  select(year, gdp) %>%
  summarize(avg_gdp = mean(gdp), 
            max_gdp = max(gdp), 
            min_gdp = min(gdp))
##   avg_gdp max_gdp min_gdp
## 1      NA      NA      NA

Whoops! This means there’s missing data. We can fix this using the same na.rm = T argument:

gapminder %>%
  filter(country == "Sri Lanka") %>%
  select(year, gdp) %>%
  summarize(avg_gdp = mean(gdp, na.rm = T), 
            max_gdp = max(gdp, na.rm = T), 
            min_gdp = min(gdp, na.rm = T))
##       avg_gdp     max_gdp    min_gdp
## 1 10425011328 29260877188 2708601390

The summarize() function takes all rows in each columns and applies a function to these rows. mean(gdp) takes the mean of all observations of gdp for Sri Lanka and returns the average, summarized as the new variable avg_gdp.

Feeling confused? Great! Let’s go through each of the main dplyr functions in a bit more detail to make sure you understand how they work.


filter()

What if we’re interested in some of the countries in the EU in which people speak French (oui oui). We can no longer use the format we used above for one country, Sri Lanka, where we simply used filter(country == "Sri Lanka"). Now we need to filter rows that belong to a list of countries. We can do this as follows:

francophone <- gapminder %>%
  filter(country %in% c("France", "Belgium", "Switzerland")) # ok, I forgot a few small countries
head(francophone)
##       country year infant_mortality life_expectancy fertility population
## 1     Belgium 1960             29.5           69.59      2.60    9140563
## 2      France 1960             23.7           70.49      2.77   45865699
## 3 Switzerland 1960             21.6           71.46      2.52    5296120
## 4     Belgium 1961             28.1           70.46      2.63    9200393
## 5      France 1961             22.4           71.07      2.80   46471083
## 6 Switzerland 1961             21.2           71.79      2.55    5393411
##            gdp continent         region
## 1  68236665814    Europe Western Europe
## 2 349778187326    Europe Western Europe
## 3           NA    Europe Western Europe
## 4  71634993490    Europe Western Europe
## 5 369037927246    Europe Western Europe
## 6           NA    Europe Western Europe

Nice! You could confirm your filter() worked as planned using good ol’ unique():

unique(francophone$country)
## [1] Belgium     France      Switzerland
## 185 Levels: Albania Algeria Angola Antigua and Barbuda Argentina ... Zimbabwe

Nailed it. If we want to expand La Francophonie, we simply add more countries to our list:

francophone <- gapminder %>%
  filter(country %in% c("Belgium", "France", "Switzerland", "Morocco", "Madagascar"))
unique(francophone$country)
## [1] Belgium     France      Madagascar  Morocco     Switzerland
## 185 Levels: Albania Algeria Angola Antigua and Barbuda Argentina ... Zimbabwe

Another nice trick is to know how to filter by removing a country, as we saw above. Say, for example, we want to create a dataset with all countries except for Brazil (desculpa), we could do the following:

not_brazil <- gapminder %>% 
  filter(country != "Brazil")

The != symbols mean “not equal to.” What if we want to filter out countries in a longer list:

not_some_countries <- gapminder %>% 
  filter(!country %in% c("Brazil", "Spain", "Mexico"))

To do this, you put an ! in front of the variable on which you are filtering. This reads something like “not country in Brazil, Spain, and Mexico.” This is a fantastic summary (from the R for Data Science text) of how you can use logical operators to subset and filter your data:

So for example, you could try to filter out rows where both condition 1 and condition 2 are true using &:

gapminder %>%
  filter(year == 2016) %>%
  filter(region == "Western Asia" & life_expectancy > 75) # note that you can also just use a comma here instead of &

Here we have the list of countries in Western Asia in 2016 where life expectancy was above 75 years.

We can also filter out rows where condition 1 or condition 2 are true using |:

gapminder %>%
  filter(year == 2016) %>%
  filter(region == "Western Asia" | life_expectancy > 75)

This is a weird result, as it shows countries that are either in Western Asia or that have a life expectancy above 75.

Finally, we can also use dplyr to remove missing data from our data.frame:

gapminder %>%
  filter(!is.na(life_expectancy))

filter() dropped all rows where life_expectancy is not (shown by the !) equal to NA. The is.na() function is a good way to test for missing data. It returns a set of logical TRUE, FALSE indicators of whether each observation is flagged as NA:

x <- c(1, 2, 3)
is.na(x)
## [1] FALSE FALSE FALSE
y <- c(NA, 2, 3)
is.na(y)
## [1]  TRUE FALSE FALSE

select()

Select simply selects the columns you’re interested in working with. This can be useful when you’re working with big datasets with lots of variables (columns). If you wanted to create a new dataset with only the variables lifeExp, year, and country, you would do the following:

le <- gapminder %>% 
  select(life_expectancy, year, country)
head(le)
##   life_expectancy year             country
## 1           62.87 1960             Albania
## 2           47.50 1960             Algeria
## 3           35.98 1960              Angola
## 4           62.97 1960 Antigua and Barbuda
## 5           65.39 1960           Argentina
## 6           66.86 1960             Armenia

mutate()

What if we want to add a new column that divides gdp by population to generate an estimate of GDP per capita? mutate() can be used to create new variables (columns) that are the result of operations (+, *, -, etc) on columns already in the dataset:

gapminder %>%
  mutate(gdp_pc = gdp/population)

In the mutate() function, after you point to the data.frame you want to mutate (gapminder in our case), you need to create a new variable (gdp_pc), and then describe the operations you want to perform (division here). This adds a new column to the gapminder data.frame called gdp_pc. Look over in your Environment (top right). Does the gapminder variable include this new column? Nope! To update the gapminder data.frame, you either need to overwrite your existing gapminder dataset (careful this replaces the original) or create a new data.frame with the the new gdp_pc column:

gapminder <- gapminder %>% 
  mutate(gdp_pc = gdp/population)  # replaces original gapminder data.frame, so be careful!
new_gm <- gapminder %>% 
  mutate(gdp_pc = gdp/population) # better than overwriting the original!

If you only want to keep the new variables, use transmute():

transmute_example <- gapminder %>% transmute(gdp_pc = gdp/population)
head(transmute_example)
##     gdp_pc
## 1       NA
## 2 1242.992
## 3       NA
## 4       NA
## 5 5253.501
## 6       NA

How about a more complicated example? What if we want to add a new variable to our dataset, say an indicator of whether or not a country is located in the continent Africa. Here’s where mutate() really shines:

africa <- gapminder %>%
  mutate(africa = ifelse(continent == "Africa", 1, 0))

glimpse(africa)
## Rows: 10,545
## Columns: 10
## $ country          <fct> Albania, Algeria, Angola, Antigua and Barbuda, Arg...
## $ year             <int> 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960, 19...
## $ infant_mortality <dbl> 115.40, 148.20, 208.00, NA, 59.87, NA, NA, 20.30, ...
## $ life_expectancy  <dbl> 62.87, 47.50, 35.98, 62.97, 65.39, 66.86, 65.66, 7...
## $ fertility        <dbl> 6.19, 7.65, 7.32, 4.43, 3.11, 4.55, 4.82, 3.45, 2....
## $ population       <dbl> 1636054, 11124892, 5270844, 54681, 20619075, 18673...
## $ gdp              <dbl> NA, 13828152297, NA, NA, 108322326649, NA, NA, 966...
## $ continent        <fct> Europe, Africa, Africa, Americas, Americas, Asia, ...
## $ region           <fct> Southern Europe, Northern Africa, Middle Africa, C...
## $ africa           <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

This creates a new data.frame called africa to which I’ve added a new variable called africa that is equal to 1 if the observation is located in the continent of Africa and 0 if it is not. The ifelse() function is quite useful, check it out using ?. Also, note glimpse(). This is the tidyverse response to head()… I actually like the glimpse() function much more and tend to use it when I inspect data. It lets you all of column names in one shot as well as other important info like variable class and dimensions.


arrange()

What if we wanted to sort the countries in our gapminder dataset in descending alaphabetical order, starting with Albania? arrange() can do that! arrange() defaults to sorting alphabetically and/or in increasing order numerically.

gapminder %>% 
  arrange(country)

If you want to sort in descending alphabetical order (so start with Zimbabwe), you have to add desc():

gapminder %>%
  arrange(desc(country))

group_by() and summarize()

What if you want to compute the average life expectancy for each country in the dataset, e.g. summarize data by type? summarize() allows you to apply functions like mean() to a column (or groups of data as we’ll see below) and return a single value:

gapminder %>%
  summarize(mean(life_expectancy))
##   mean(life_expectancy)
## 1              64.81162

What if we want to return average life expectancy for each country? We can use the group_by() function to group data by country, then the summarize() function to summarize the average life expectancy for each group of country data. To do this, we need to use multiple dplyr functions in one go. We can do this using a pipe, which looks like this: %>%

mle <- gapminder %>% 
  group_by(country) %>% 
  summarize(mean(life_expectancy))

head(mle)
## # A tibble: 6 x 2
##   country             `mean(life_expectancy)`
##   <fct>                                 <dbl>
## 1 Albania                                72.3
## 2 Algeria                                65.0
## 3 Angola                                 48.4
## 4 Antigua and Barbuda                    71.5
## 5 Argentina                              71.4
## 6 Armenia                                71.3

This returns a new data.frame that lists each country and the mean(lifeExp) for each country. If we want a nicer column name for this new average life expectancy data, we can specify this in the summarize() function:

mle2 <- gapminder %>% 
  group_by(country) %>% 
  summarize(mean_le = mean(life_expectancy), n = n())

head(mle2)
## # A tibble: 6 x 3
##   country             mean_le     n
##   <fct>                 <dbl> <int>
## 1 Albania                72.3    57
## 2 Algeria                65.0    57
## 3 Angola                 48.4    57
## 4 Antigua and Barbuda    71.5    57
## 5 Argentina              71.4    57
## 6 Armenia                71.3    57

I also added the n() function that just returns the number of observations in each group (in our case, the number of years observed for each country).

Did you notice that we didn’t have to list gapminder in the summarize function like we did above? This is because the pipe (%>%) feeds the data produced by group_by() into the summarize function. This way, we can write really long pipes that feed data through a complex process without having to specify the dataset at each step. Let’s work through some examples.


Examples

As with any language, once you know a few core verbs, you can get pretty far. Now that you are familiar with filter(), select(), mutate(), arrange(), group_by(), and summarize(), let’s try to have some basic conversations with the gapminder dataset that will help us understand global human development.

  1. From 2000 to 2016, which country has had, on average, the highest gdp? Which has had the lowest?
# highest GDP
gapminder %>%
  filter(year %in% 2000:2016) %>% # 2000:2016 generates a vector of numbers from 2000 to 2016
  group_by(country) %>%
  summarize(avg_gdp = mean(gdp, na.rm = T)) %>% 
  arrange(desc(avg_gdp)) %>%
  filter(row_number() == 1) # cool little trick that lets you pull out the first row, in our case, highest GDP
## # A tibble: 1 x 2
##   country       avg_gdp
##   <fct>           <dbl>
## 1 United States 1.10e13
# lowest GDP, note that to do this, I just dropped the desc() function in arrange()
gapminder %>%
  filter(year %in% 2000:2016) %>% 
  group_by(country) %>%
  summarize(avg_gdp = mean(gdp, na.rm = T)) %>% 
  arrange(avg_gdp) %>%
  filter(row_number() == 1) 
## # A tibble: 1 x 2
##   country    avg_gdp
##   <fct>        <dbl>
## 1 Kiribati 73446284.
  1. Right, GDP is interesting but depends on big differences in population (e.g. Kiribati versus the United States). Let’s re-do this, but now with a new measure of GDP per capita (GDP divided by population).

(Programming note: What’s cool here is that I can just copy-paste the code from above and add a mutate() function that creates our new gdp_pc variable)

# highest GDP per capita
gapminder %>%
  mutate(gdp_pc = gdp/population) %>%
  filter(year %in% 2000:2016) %>% 
  group_by(country) %>%
  summarize(avg_gdp_pc = mean(gdp_pc, na.rm = T)) %>% 
  arrange(desc(avg_gdp_pc)) %>%
  filter(row_number() == 1) 
## # A tibble: 1 x 2
##   country    avg_gdp_pc
##   <fct>           <dbl>
## 1 Luxembourg     51576.
# lowest GDP per capita
gapminder %>%
  mutate(gdp_pc = gdp/population) %>%
  filter(year %in% 2000:2016) %>%
  group_by(country) %>%
  summarize(avg_gdp_pc = mean(gdp_pc, na.rm = T)) %>% 
  arrange(avg_gdp_pc) %>%
  filter(row_number() == 1) 
## # A tibble: 1 x 2
##   country          avg_gdp_pc
##   <fct>                 <dbl>
## 1 Congo, Dem. Rep.       95.7

Take a second to react to those very different numbers.

  1. Which region has the highest average life expectancy since 2000? The lowest?
gapminder %>%
  filter(year >= 2000) %>% # greater than or equal to
  group_by(region) %>%
  summarize(mle = mean(life_expectancy, na.rm = T)) %>%
  arrange(desc(mle)) %>%
  filter(row_number() == 1)
## # A tibble: 1 x 2
##   region                      mle
##   <fct>                     <dbl>
## 1 Australia and New Zealand  80.8
gapminder %>%
  filter(year >= 2000) %>% # greater than or equal to
  group_by(region) %>%
  summarize(mle = mean(life_expectancy, na.rm = T)) %>%
  arrange(mle) %>%
  filter(row_number() == 1)
## # A tibble: 1 x 2
##   region            mle
##   <fct>           <dbl>
## 1 Southern Africa  52.0
  1. How have global average fertility rates changed through since 2000?
gapminder %>%
  filter(year > 2000) %>%
  group_by(year) %>%
  summarize(mfr = mean(fertility, na.rm = T))
## # A tibble: 16 x 2
##     year    mfr
##    <int>  <dbl>
##  1  2001   3.18
##  2  2002   3.14
##  3  2003   3.09
##  4  2004   3.06
##  5  2005   3.02
##  6  2006   2.99
##  7  2007   2.97
##  8  2008   2.94
##  9  2009   2.91
## 10  2010   2.89
## 11  2011   2.85
## 12  2012   2.82
## 13  2013   2.80
## 14  2014   2.77
## 15  2015   2.74
## 16  2016 NaN

I hope you’re starting to get a sense of how powerful these tools are. Well get ready… looking at tables of numbers is fun and all, but nothing like visualizing data with graphs. See, for example, a tidyverse-generated visualization that could help us understand question 4:

Critical Reflection Questions

  • Thinking critically about what lives in your data
  • Bias, sampling
  • The challenge of cleaning data (not really covered here)

DIY Questions

Additional Resources

  • In this lab, I’ve reviewed how to subset and wrangle your data using dplyr. You can also do this in base R and it’s often quite useful to know how to do this. I recommend the learnR tutorial or these tutorials on data subsetting, data manipulation. Make sure you’re familiar with how to index and wrangle data in base R before we proceed!
  • For more complex data merges and joins, check out this overview.
  • For more examples of data wrangling and manipulation with dplyr, I recommend this post as well as the pre-assignment readings written by the dplyr creator Hadley Wickham.
  • Read the R for Data Science chapter on Data Transformation.
  • Visit the Gapminder website to learn more about the data we are working with this week.

  1. Note that since country is a character vector, “Sri Lanka” is in quotes. year, on the other hand, is an integer vector, so we can just write out the number without quotes.