Pre-lab assignment

My assumptions

base R functions

If you’re new to R, look at the following functions before beginning the tutorial:

Set up

For this lab, you’ll need to make sure the following packages are installed (install.pakages()) and loaded (library()) into your RStudio session.


Our data

This week we’ll be playing with the 500 Cities data set. Here’s a brief introduction to our data:

The 500 Cities project is a collaboration between CDC, the Robert Wood Johnson Foundation, and the CDC Foundation. The purpose of the 500 Cities Project is to provide city- and census tract-level small area estimates for chronic disease risk factors, health outcomes, and clinical preventive service use for the largest 500 cities in the United States. These small area estimates will allow cities and local health departments to better understand the burden and geographic distribution of health-related variables in their jurisdictions, and assist them in planning public health interventions.

The project website has tons of fantastic mapping and visualization tools, so go check it out. You can find specific information about each of the variables (columns) in the data set here. Columns to identify the data include State, CityName, GeographicLocation, and Year. You can organize the data topically using the Category column or by data source with DataSource. The actual variables being measured are found in the Measure column and the units for each of these in the Data_Value_Unit column. The actual value for the Measure is found in the Data_Value column. Notice that this data set is not tidy. Ideally, we would split each different Measure into a separate column. For now, however, we’ll work with the data set as is. We’ll fix things below when we work through correlation.

Crucially, in addition to the Data_Value, low and high limit confidence intervals are listed in the Low_Confidence_Limit and High_Confidence_Limit columns. Don’t know what a confidence limit is? We’ll get there, but this might be a gentle nudge for you to check out the statistics resources I’ve listed on the More Resources page on CANVAS.

health <- read.csv("./data/500_cities_health.csv")
glimpse(health) #dplyr's version of head(), cleaner for big datasets like this one
## Observations: 810,103
## Variables: 24
## $ Year                       <int> 2014, 2014, 2014, 2014, 2014, 2014,...
## $ StateAbbr                  <fctr> US, US, US, US, US, US, US, US, US...
## $ StateDesc                  <fctr> United States, United States, Unit...
## $ CityName                   <fctr> , , , , , , , , , , , , , , , , , ...
## $ GeographicLevel            <fctr> US, US, US, US, US, US, US, US, US...
## $ DataSource                 <fctr> BRFSS, BRFSS, BRFSS, BRFSS, BRFSS,...
## $ Category                   <fctr> Prevention, Prevention, Health Out...
## $ UniqueID                   <fctr> 59, 59, 59, 59, 59, 59, 59, 59, 59...
## $ Measure                    <fctr> Current lack of health insurance a...
## $ Data_Value_Unit            <fctr> %, %, %, %, %, %, %, %, %, %, %, %...
## $ DataValueTypeID            <fctr> AgeAdjPrv, CrdPrv, AgeAdjPrv, CrdP...
## $ Data_Value_Type            <fctr> Age-adjusted prevalence, Crude pre...
## $ Data_Value                 <dbl> 14.9, 14.1, 23.5, 25.6, 16.8, 16.0,...
## $ Low_Confidence_Limit       <dbl> 14.6, 13.8, 23.3, 25.4, 16.6, 15.8,...
## $ High_Confidence_Limit      <dbl> 15.2, 14.3, 23.7, 25.9, 17.1, 16.2,...
## $ Data_Value_Footnote_Symbol <fctr> , , , , , , , , , , , , , , , , , ...
## $ Data_Value_Footnote        <fctr> , , , , , , , , , , , , , , , , , ...
## $ PopulationCount            <int> 308745538, 308745538, 308745538, 30...
## $ GeoLocation                <fctr> , , , , , , , , , , , , , , , , , ...
## $ CategoryID                 <fctr> PREVENT, PREVENT, HLTHOUT, HLTHOUT...
## $ MeasureId                  <fctr> ACCESS2, ACCESS2, ARTHRITIS, ARTHR...
## $ CityFIPS                   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ TractFIPS                  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ Short_Question_Text        <fctr> Health Insurance, Health Insurance...

There are some pretty interesting health metrics collected in this data set:

##  [1] Current lack of health insurance among adults aged 18–64 Years                                                                                                                              
##  [2] Arthritis among adults aged >=18 Years                                                                                                                                                        
##  [3] Binge drinking among adults aged >=18 Years                                                                                                                                                   
##  [4] High blood pressure among adults aged >=18 Years                                                                                                                                              
##  [5] Taking medicine for high blood pressure control among adults aged >=18 Years with high blood pressure                                                                                         
##  [6] Cancer (excluding skin cancer) among adults aged >=18 Years                                                                                                                                   
##  [7] Current asthma among adults aged >=18 Years                                                                                                                                                   
##  [8] Coronary heart disease among adults aged >=18 Years                                                                                                                                           
##  [9] Visits to doctor for routine checkup within the past Year among adults aged >=18 Years                                                                                                        
## [10] Cholesterol screening among adults aged >=18 Years                                                                                                                                            
## [11] Fecal occult blood test, sigmoidoscopy, or colonoscopy among adults aged 50–75 Years                                                                                                        
## [12] Chronic obstructive pulmonary disease among adults aged >=18 Years                                                                                                                            
## [13] Physical health not good for >=14 days among adults aged >=18 Years                                                                                                                           
## [14] Older adult men aged >=65 Years who are up to date on a core set of clinical preventive services: Flu shot past Year, PPV shot ever, Colorectal cancer screening                              
## [15] Older adult women aged >=65 Years who are up to date on a core set of clinical preventive services: Flu shot past Year, PPV shot ever, Colorectal cancer screening, and Mammogram past 2 Years
## [16] Current smoking among adults aged >=18 Years                                                                                                                                                  
## [17] Visits to dentist or dental clinic among adults aged >=18 Years                                                                                                                               
## [18] Diagnosed diabetes among adults aged >=18 Years                                                                                                                                               
## [19] High cholesterol among adults aged >=18 Years who have been screened in the past 5 Years                                                                                                      
## [20] Chronic kidney disease among adults aged >=18 Years                                                                                                                                           
## [21] No leisure-time physical activity among adults aged >=18 Years                                                                                                                                
## [22] Mammography use among women aged 50–74 Years                                                                                                                                                
## [23] Mental health not good for >=14 days among adults aged >=18 Years                                                                                                                             
## [24] Obesity among adults aged >=18 Years                                                                                                                                                          
## [25] Papanicolaou smear use among adult women aged 21–65 Years                                                                                                                                   
## [26] Sleeping less than 7 hours among adults aged >=18 Years                                                                                                                                       
## [27] Stroke among adults aged >=18 Years                                                                                                                                                           
## [28] All teeth lost among adults aged >=65 Years                                                                                                                                                   
## 28 Levels: All teeth lost among adults aged >=65 Years ...

Imagine you work for the CDC. You’ve got a pile of money to allocate across US cities to promote the general health and well-being of urban populations. Your task is to generate a report to guide the allocation of available financial resources. How would you proceed?

Descriptive statistics

A good first step is for you to familiarize yourself with the data. It’s crucial for you to have a deep understanding of how the data was collected and the assumptions made when collecting the data. Key things to explore and understand:

In this tutorial, we will focus on the Measure of the prevalence of Cancer (excluding skin cancer) among adults aged >=18 years. This variable is the ratio of Respondantss aged >18 who report ever having been told by a doctor, nurse, or other health professional that they have any other types of cancer divided by the total number of respondents aged 18 or older. I know from reading about this variable on the 500 cities website that it was collected from the Behavioral Risk Factor Surveillance System which is a national telephone survey that collects state-level data about U.S. resident behaviors and health outcomes. The BRFSS completes around 400,000 phone interviews annually. Phone surveys are notoriously non-representative. How do you reach people without landlines or cellphones? People who work odd hours and don’t answer when you call? People who have three kids and no time to answer a phone call to take a survey about their health? You get the idea… No dataset’s is perfect, and it is important to keep each dataset’s limitations in mind as we work through our analyses.

Once you familiarize yourself with the dataset content and variable construction, it’s good practice explore variables of interest by plotting their distribution. This helps you get a sense of the variable’s center (mean/median/mode) and spread (standard deviation/variance). The best way to quickly visualize a variable’s distribution is using a histogram. Histograms are a visual representation of the count of instances of a variable that fall into an interval of values. They are easy to plot in base R with the hist() function:

cancer <- health %>% filter(Measure == "Cancer (excluding skin cancer) among adults aged >=18 Years")
hist(cancer$Data_Value, 100, main = "Adult cancer rate", xlab = "Rate amongst adults older than 18", ylab = "Frequency")

The 100 specifies the number of bins into which we want to place our data set. Try changing the number of bins and see what happens. main specifies the main title. You can change axis labels using xlab and ylab.

When I look at this histogram, the first thing I see is that the data has a long right-hand tail. This suggests that there is a group of cities that have abnormally high cancer rates. dplyr can help us quickly identify these cities. By looking at the histogram, counties with rates above 10 seem to be abnormally low. But we’re data scientists, so let’s use something a bit more serious than visual inspection of a histogram. Let’s start with our measures of center:

## [1] NA
## [1] NA

Huh!? What’s going on here!? Important R lesson. Most datasets have missing values, which we signal in R using NA. NA means Not Available, i.e. this data was not collected or is missing for some other reason. You may also see NaN, which is also important but different. NaN means Not a Number. If you try to do something crazy like 0/0, R will return NaN (more info about this can be found here. When you try to compute base R functions on vectors or matrices that contain NAs, R will return NA. So are we doomed? No, it’s OK, simply add this:

mean(cancer$Data_Value, na.rm = T)
## [1] 5.448483
median(cancer$Data_Value, na.rm = T)
## [1] 5.3

Ok, it’s not actually that simple (nothing ever is guys). If you have missing data, you need to figure out WHERE your data is missing and if possible WHY. Missing data, just like sampling strategies, can introduce bias into any analyses you conduct with your data. This is especially important when working with survey data where characteristics of respondents may create systematic variations in response rates for certain questions. For example, marginalized folks may not want to disclose information because it makes them more vulnerable. Let’s use dplyr to see where we have missing data:

cancer %>% filter(is.na(Data_Value)) %>% 
  group_by(CityName) %>% 
  summarize(n = n()) %>%
## # A tibble: 277 x 2
##    CityName        n
##    <fctr>      <int>
##  1 Houston        50
##  2 New York       23
##  3 Kansas City    16
##  4 Dallas         15
##  5 San Antonio    15
##  6 Knoxville      14
##  7 Tucson         12
##  8 San Diego      10
##  9 Gastonia        9
## 10 Los Angeles     9
## # ... with 267 more rows

This code uses dplyr to filter code where the Data_Value is listed as NA using the is.na() function. is.na() returns a TRUE or FALSE based on, well, whether or not your data is NA. If x <- NA, then is.na(x) is TRUE.

I don’t see any clear patterns in the cities with missing data here except for under-representation of Houston. Let’s look at another variable, ’GeographicLevel`, which is the level at which data is collected (national, census tract, or city):

cancer %>% filter(is.na(Data_Value)) %>%
  group_by(GeographicLevel) %>%
  summarize(n=n()) %>%
## # A tibble: 1 x 2
##   GeographicLevel     n
##   <fctr>          <int>
## 1 Census Tract      800

Huh, interesting. Here we see that all of our missing data is at the Census Tract level. If we extract only data collected at the level of a city, we should remove missing observations.

cancer <- cancer %>% filter(GeographicLevel == "City")
## [1] 5.7263
## [1] 5.8

Excellent. We no longer need to add na.rm=T to our mean() and median() functions since we’ve removed rows containing missing data.

After inspecting the cleaned cancer data set, I realized that there are also multiple Data_Value_Types for this Measure. This includes Crude Prevalence and Age-adjusted prevalence, i.e two different ways to “adjust” the prevalance of different health issues, one overall prevalence (Crude Prevalence) and the other adjusted by age (Age-adjusted prevalence). We’ll need to select one to proceed. Let’s go with Age-adjusted prevalence:

cancer <- cancer %>% filter(Data_Value_Type == "Age-adjusted prevalence")

Let’s re-plot the histogram1:

hist(cancer$Data_Value, 100, main = "Adult cancer rate", xlab = "Age-adjusted prevalence amongst adults older than 18", ylab = "Frequency")

Now that we’ve resolved the NA issue, let’s get back to our first task: to identify cities with abnormally high and low rates of cancer. We typically refer to observations with abnormally high or low observations as outliers. Outliers are observations whose values are abnormally far from the other values in a sample. The standard deviation of a variable describes the spread of observations of that variable around the central value of the data set (mean). We can compute standard deviation to identify variables at the tails of the distribution above. Let’s compute the standard deviation of these rates and generate lists of cities with exceptionally high (> 2 standard deviations) and low (<2 standard deviations) cancer rates:

sd <- sd(cancer$Data_Value)
print(paste("Our standard deviation is:", sd))
## [1] "Our standard deviation is: 0.494553745121425"
highsd <- mean(cancer$Data_Value) + 2*sd
lowsd <- mean(cancer$Data_Value) - 2*sd
print(paste("Very low rates of cancer are below", lowsd, "while very high rates are above", highsd, sep=" "))
## [1] "Very low rates of cancer are below 4.86329250975715 while very high rates are above 6.84150749024285"

Now let’s use dplyr to identify cities of high and low cancer rates:

high_rate <- cancer %>% filter(Data_Value > highsd) %>% 
  select(CityName, StateAbbr, Data_Value) %>% 

low_rate <- cancer %>% filter(Data_Value < lowsd) %>% 
  select(CityName, StateAbbr, Data_Value) %>% 

First, our high rate cities:

##         CityName StateAbbr Data_Value
## 1        Spokane        WA        6.9
## 2 Spokane Valley        WA        6.9

And our low rate:

##        CityName StateAbbr Data_Value
## 1         Miami        FL        4.8
## 2        Cicero        IL        4.8
## 3        Camden        NJ        4.8
## 4        Newark        NJ        4.8
## 5       Lynwood        CA        4.7
## 6     Santa Ana        CA        4.7
## 7    Union City        CA        4.7
## 8       Passaic        NJ        4.7
## 9      Paterson        NJ        4.7
## 10    Daly City        CA        4.6
## 11   South Gate        CA        4.6
## 12      Mission        TX        4.6
## 13     Alhambra        CA        4.5
## 14 Baldwin Park        CA        4.5
## 15     Milpitas        CA        4.5
## 16      Hialeah        FL        4.5
## 17      El Paso        TX        4.5
## 18     El Monte        CA        4.4
## 19   Union City        NJ        4.4
## 20      McAllen        TX        4.4
## 21        Pharr        TX        4.4
## 22  Brownsville        TX        4.3
## 23     Edinburg        TX        4.3
## 24       Laredo        TX        4.1

In future classes we’ll learn how to visualize this data spatially and detect spatial components of our outliers. WOWWWW!

Outliers can significantly affect the statistics we compute. Mean and standard deviation are easily affected by extreme observations since these extremes are a part of their computation. A number of robust statistics exist that are not easily affected by outliers and skew. If you are worried about the impact of outliers and skew on your analyses, consider learning more about these robust statistics such as the Interquartile Range (IQR) (middle 50% of the data set) and median.

Descriptive visualization tools

Histograms are a fantastic way to visualize the spread of numerical data for a single variable. Bar charts can help us visualize data patterns across categories. Say we want to quickly see the number of observations in our cancer data collected in each state:

ggplot(data = cancer) +
  geom_bar(mapping = aes(x = StateAbbr))

OK, that’s a nice start, but that’s a fantastically ugly bar chart. Let’s clean it up:

cancer %>% group_by(StateAbbr) %>% 
  summarize(n = n()) %>%
  ungroup() %>%
  arrange(desc(n)) %>%
  mutate(StateAbbr = ordered(StateAbbr, levels = unique(StateAbbr))) %>%
  ggplot(data = .) +
  geom_bar(mapping = aes(x = StateAbbr, y = n), stat="identity", fill = "purple", color = "orange") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  xlab("State") +
  ylab("Number of cities") +
  ggtitle("500 cities dataset")

First, YES, you can pipe straight into ggplot(). You can tell ggplot() to use the piped data with the . symbol in ggplot(data = .), which essentially tells ggplot to use what you’ve handed it via pipes. Note that once you send the data to ggplot() you’re no longer piping (%>%), but adding specifications to ggplot() using +.

Second, I’ve use dplyr to clean and organize the data I want to plot by grouping by state, counting observations by state, and arranging rows by the number of observations for each state. Notice the ungroup() and mutate() steps. We pipe the data into groups and summarize it, but to plot the nice ordered bar plot you see above with high count states at the left and low count cities on the right, you then need to ungroup the data and mutate things to order the group ID (StateAbbr) and factorize it with levels. Whew. Or you can just ignore those steps and plot bars that aren’t in order; that’s totally fine too. I also tilted the text on the x-axis so state names are easier to read and I added some color. I know, I know, purple and orange are bold choices (ahem, Go Clemson Tigers), but I wanted to show you how to change both the fill and the border of your bars. Finally, don’t forget to add stat="identity", per the geom_bar() documentation: By default, geom_bar uses stat=“count” which makes the height of the bar proportion to the number of cases in each group (or if the weight aethetic is supplied, the sum of the weights). If you want the heights of the bars to represent values in the data, use stat=“identity” and map a variable to the y aesthetic. For more on the many, many, many ways you can display your data with geom_bar(), go here.

Comparative statistics

Box and whisker plots

Back to our central challenge. We have money to promote health in regions around the country. Imagine you have to prioritize certain states, i.e. Utah gets the money but the poor folks back in Tennessee are left hanging. How would you begin to determine which state should receive support?

One of the most important things you can take away from this lab is the following:


Let me just repeat that:


Now I’ll show you how this is true. Let’s begin by grouping our data at the state level and computing the average cancer rate for each state:

cancer %>% group_by(StateAbbr) %>% 
  summarize(mean = mean(Data_Value)) %>%
  ungroup() %>%
  arrange(desc(mean)) %>%
  mutate(StateAbbr = ordered(StateAbbr, levels = unique(StateAbbr))) %>%
  ggplot(data = .) +
  geom_bar(mapping = aes(x = StateAbbr, y = mean), stat="identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  xlab("State") +
  ylab("Average rate of cancer") +
  ggtitle("500 cities dataset")

Easy. Let’s give money to Oregon and Kentucky. They seem to be doing much worse than anyone else. Probably all the fancy micro-brews and Bourbon. But wait! Remember what I said:


Enter an extremely powerful data visualization tool: box and whisker plots.

cancer %>% filter(StateAbbr %in% c("OR", "UT", "NJ", "KY", "CA", "HI")) %>%
  ggplot(data = .) +
  geom_boxplot(mapping = aes(x = StateAbbr, y = Data_Value)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  xlab("State") +
  ylab("Average rate of cancer") +
  ggtitle("500 cities dataset")

What’s going on here? I’ve selected six states, two that are the “worst off” (OR and KY), two that appear to have low cancer rates (NJ and HI), and two states in the middle (UT and CA). If you look at the bar graph, you’d say that Oregon has unquestionably higher cancer rates than California. Well, look again. Think of the box and whisker plot above as variable distributions turned on their sides. The mean of the distribution is shown by the horizontal bar. The white box around the mean represents the 25% to 75% quartiles in the distribution, and the thin vertical line represents the area describing 95% of the distribution. See that dot for NJ? That’s an outlier, a city whose value falls outside of the 95% distribution for the data set.

There are a number of different rules for determining if a point is an outlier, but the method that R and ggplot2 use is the “1.5 rule”. If a data point is:

  • less than Q1 - 1.5*IQR
  • greater than Q3 + 1.5*IQR

then that point is classed as an “outlier”. Here, Q1 and Q3 refer to the data set’s quartiles, or points at which the ranked data set can be divided into four equal groups. The IQR is the Interquartile Range or the middle 50% of a data set (Q3-Q1). The whiskers are defined as:

  • upper whisker = min(max(x), Q3 + 1.5 * IQR)
  • lower whisker = max(min(x), Q1 – 1.5 * IQR)

A way to visualize this is as follows:

Now that we understand what the box and whisker plots are helping us visualize, let’s interpret our box and whisker Boxes that are relatively tall indicate a state with cities that have many different cancer rates. California, for example, is a large state containing many observations (cities). Hawaii, on the other hand, contains only one observation so there is no variation in observed cancer rates. Notice that Oregon and Kentucky are much higher than the other box plots. This is suggestive of a difference between these states and the other states in rates of cancer. Notice, however, that the whiskers of California and Oregon overlap, suggesting that (albeit with very low probability) the distribution of cancer rates in California may not be significantly different from the distribution of cancer rates in Oregon.


So far, we’ve focused on the variation of a single variable. It is often very useful to explore how two variables covary. We can think of the covariance of two variables as the joint variation of two variables about their common mean. To remove any influence from difference in the measurement units of the two variables, we often compute correlation, which is simply the covariance of two variables divided by the product of the standard deviations of each variable.2 A correlation of +1 indicates a perfect direct relationship between two variables while -1 indicates a perfect indirect relationship.

Before we begin, we’ll need to clean our data set. Refer back to R for Data Science, Chapter 12.3 for a reminder of (1) what tidy data should look like and (2) how to gather and spread data. In our case, an observation is spread across multiple rows, i.e. each city-time is repeated in multiple rows for each Measure. Using the spread() function from the tidyr package in the tidyverse we can create a separate column for each measure. This will make it much easier to select Data_Values for specific measures to compute correlation. I start by filtering out observations we’re not interested in (select only 2014, focus on cities, and select only Age-adjusted prevalence outcomes). Then I select the subset of columns I’m interested in and spread() the data:

# spread data and subset based on Year and Data_Value_Type
health_tidy <- health %>%
  filter(GeographicLevel == "City" & Data_Value_Type == "Age-adjusted prevalence" & Year == 2014) %>%
  select(CityName, StateAbbr, Measure, Data_Value, PopulationCount) %>%
  spread(key = Measure, value = Data_Value)

Now we have tidy data! Each row is an observation (City-Year) and each column is a variable. Back to our original objective… compute correlation. Imagine you are interested in the correlation between physical activity and mental health:

cor(health_tidy$`Mental health not good for >=14 days among adults aged >=18 Years`, health_tidy$`Physical health not good for >=14 days among adults aged >=18 Years`)
## [1] 0.9402613

This suggests that there is a strong direct correlation between physical activity and mental health, i.e. higher levels of physical activity are generally associated with higher levels of mental health. So working out decreases depression? Not so fast.


Let me repeat that:


Should I repeat again? Ok.


Visit this website for many examples of how two random and unrelated variables may covary across space or time (‘Age of Miss America’ and ‘Murders by steam, hot vapours and hot objects’), though I think there’s some truth to this one as Nicolas Cage generally makes me want to drown myself:

Establishing a causal link between two variables requires fairly sophisticated statistics. If you are interested in learning more about this, I highly recommend looking into the field of causal inference. I also recommend befriending an econometrician. The point here is that you, as data producers and consumers, should be very critical of any claims you read/produce establishing a causal link between one thing and another.

Back to our health data example. A nice way to visualize correlation between two variables is with a scatter plot. Again, this won’t help us establish causality but it will help is explore general and important patterns in our data:

ggplot(data = health_tidy, aes(`Mental health not good for >=14 days among adults aged >=18 Years`, `Physical health not good for >=14 days among adults aged >=18 Years`)) +
  geom_point() +
  xlab("Mental health") +
  ylab("Physical activity")

geom_point() makes it easy for us to change the symbology of these points based on groups they belong to or other variables. We could, for example, change point size based on the population of the city.

ggplot(data = health_tidy, aes(`Mental health not good for >=14 days among adults aged >=18 Years`, `Physical health not good for >=14 days among adults aged >=18 Years`)) +
  geom_point(aes(size = PopulationCount))+
  xlab("Mental health") +
  ylab("Physical activity")