- Understand the difference between a sample and a population.
- Identify and query missing data.
- Compute and visualize descriptive statistics.
- Identify and visualize outliers.
- Visualize and describe variable distributions.
- Begin multivariate data visualization.
- Compute and visualize correlation.

- Review the chapters on Exploratory Data Analysis in the R for Data Science text and on data visualization. Think of these as reference texts. They are the best written overviews of exploratory data analysis and visualization I’ve come across.

- I am assuming you’ve followed a basic statistics course and have a sense of what the following variables mean and how they are computed: mean, median, mode; standard deviation, standard error, variance; quartiles; confidence intervals.
- We’re working with a big, messy dataset this week because most datasets are big and messy. If this dataset is
*too*big for to load on your computer, shoot me an email and I’ll send you a subset of the full dataset.

If you’re new to `R`

, look at the following functions before beginning the tutorial:

`read.csv()`

- this function lets you load CSV files, or comma separated value files, in`R`

, more info here`unique()`

- this computes the unique values in a vector, so for a list of 1, 1, 2, 2, 3, the unique function will return 1, 2, 3; more info here`hist()`

- our old friend the histogram functino! This will plot the histogram of a variable. More info here.`$`

- remember that this is used to select columns from`data.frames`

, so`data$var1`

returns the column called`var`

in the`data.frame`

called`data`

.`mean()`

,`median()`

,`sd()`

compute the mean, median, and mode of a vector respectively.- At some point, I play with the
`print()`

and`paste()`

functions.`paste()`

let’s you pull together strings. Try running`paste("Hey", " you")`

.`print()`

just spits out the contents of a variable to the console.

For this lab, you’ll need to make sure the following packages are installed (`install.pakages()`

) and loaded (`library()`

) into your `RStudio`

session.

```
library(tidyverse)
library(reshape2)
library(corrplot)
```

This week we’ll be playing with the 500 Cities data set. Here’s a brief introduction to our data:

*The 500 Cities project is a collaboration between CDC, the Robert Wood Johnson Foundation, and the CDC Foundation. The purpose of the 500 Cities Project is to provide city- and census tract-level small area estimates for chronic disease risk factors, health outcomes, and clinical preventive service use for the largest 500 cities in the United States. These small area estimates will allow cities and local health departments to better understand the burden and geographic distribution of health-related variables in their jurisdictions, and assist them in planning public health interventions.*

The project website has tons of fantastic mapping and visualization tools, so go check it out. You can find specific information about each of the variables (columns) in the data set here. Columns to identify the data include `State`

, `CityName`

, `GeographicLocation`

, and `Year`

. You can organize the data topically using the `Category`

column or by data source with `DataSource`

. The actual variables being measured are found in the `Measure`

column and the units for each of these in the `Data_Value_Unit`

column. The actual value for the `Measure`

is found in the `Data_Value`

column. Notice that this data set is *not* tidy. Ideally, we would split each different `Measure`

into a separate column. For now, however, we’ll work with the data set as is. We’ll fix things below when we work through correlation.

Crucially, in addition to the `Data_Value`

, low and high limit confidence intervals are listed in the `Low_Confidence_Limit`

and `High_Confidence_Limit`

columns. Don’t know what a confidence limit is? We’ll get there, but this might be a gentle nudge for you to check out the statistics resources I’ve listed on the More Resources page on CANVAS.

```
health <- read.csv("./data/500_cities_health.csv")
glimpse(health) #dplyr's version of head(), cleaner for big datasets like this one
```

```
## Observations: 810,103
## Variables: 24
## $ Year <int> 2014, 2014, 2014, 2014, 2014, 2014,...
## $ StateAbbr <fctr> US, US, US, US, US, US, US, US, US...
## $ StateDesc <fctr> United States, United States, Unit...
## $ CityName <fctr> , , , , , , , , , , , , , , , , , ...
## $ GeographicLevel <fctr> US, US, US, US, US, US, US, US, US...
## $ DataSource <fctr> BRFSS, BRFSS, BRFSS, BRFSS, BRFSS,...
## $ Category <fctr> Prevention, Prevention, Health Out...
## $ UniqueID <fctr> 59, 59, 59, 59, 59, 59, 59, 59, 59...
## $ Measure <fctr> Current lack of health insurance a...
## $ Data_Value_Unit <fctr> %, %, %, %, %, %, %, %, %, %, %, %...
## $ DataValueTypeID <fctr> AgeAdjPrv, CrdPrv, AgeAdjPrv, CrdP...
## $ Data_Value_Type <fctr> Age-adjusted prevalence, Crude pre...
## $ Data_Value <dbl> 14.9, 14.1, 23.5, 25.6, 16.8, 16.0,...
## $ Low_Confidence_Limit <dbl> 14.6, 13.8, 23.3, 25.4, 16.6, 15.8,...
## $ High_Confidence_Limit <dbl> 15.2, 14.3, 23.7, 25.9, 17.1, 16.2,...
## $ Data_Value_Footnote_Symbol <fctr> , , , , , , , , , , , , , , , , , ...
## $ Data_Value_Footnote <fctr> , , , , , , , , , , , , , , , , , ...
## $ PopulationCount <int> 308745538, 308745538, 308745538, 30...
## $ GeoLocation <fctr> , , , , , , , , , , , , , , , , , ...
## $ CategoryID <fctr> PREVENT, PREVENT, HLTHOUT, HLTHOUT...
## $ MeasureId <fctr> ACCESS2, ACCESS2, ARTHRITIS, ARTHR...
## $ CityFIPS <int> NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ TractFIPS <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ Short_Question_Text <fctr> Health Insurance, Health Insurance...
```

There are some pretty interesting health metrics collected in this data set:

`unique(health$Measure)`

```
## [1] Current lack of health insurance among adults aged 18â64 Years
## [2] Arthritis among adults aged >=18 Years
## [3] Binge drinking among adults aged >=18 Years
## [4] High blood pressure among adults aged >=18 Years
## [5] Taking medicine for high blood pressure control among adults aged >=18 Years with high blood pressure
## [6] Cancer (excluding skin cancer) among adults aged >=18 Years
## [7] Current asthma among adults aged >=18 Years
## [8] Coronary heart disease among adults aged >=18 Years
## [9] Visits to doctor for routine checkup within the past Year among adults aged >=18 Years
## [10] Cholesterol screening among adults aged >=18 Years
## [11] Fecal occult blood test, sigmoidoscopy, or colonoscopy among adults aged 50â75 Years
## [12] Chronic obstructive pulmonary disease among adults aged >=18 Years
## [13] Physical health not good for >=14 days among adults aged >=18 Years
## [14] Older adult men aged >=65 Years who are up to date on a core set of clinical preventive services: Flu shot past Year, PPV shot ever, Colorectal cancer screening
## [15] Older adult women aged >=65 Years who are up to date on a core set of clinical preventive services: Flu shot past Year, PPV shot ever, Colorectal cancer screening, and Mammogram past 2 Years
## [16] Current smoking among adults aged >=18 Years
## [17] Visits to dentist or dental clinic among adults aged >=18 Years
## [18] Diagnosed diabetes among adults aged >=18 Years
## [19] High cholesterol among adults aged >=18 Years who have been screened in the past 5 Years
## [20] Chronic kidney disease among adults aged >=18 Years
## [21] No leisure-time physical activity among adults aged >=18 Years
## [22] Mammography use among women aged 50â74 Years
## [23] Mental health not good for >=14 days among adults aged >=18 Years
## [24] Obesity among adults aged >=18 Years
## [25] Papanicolaou smear use among adult women aged 21â65 Years
## [26] Sleeping less than 7 hours among adults aged >=18 Years
## [27] Stroke among adults aged >=18 Years
## [28] All teeth lost among adults aged >=65 Years
## 28 Levels: All teeth lost among adults aged >=65 Years ...
```

Imagine you work for the CDC. You’ve got a pile of money to allocate across US cities to promote the general health and well-being of urban populations. Your task is to generate a report to guide the allocation of available financial resources. How would you proceed?

A good first step is for you to familiarize yourself with the data. It’s crucial for you to have a deep understanding of *how* the data was collected and the assumptions made when collecting the data. Key things to explore and understand:

- Target population and sampling strategy
- Variable measurement and construction
- Potential bias

In this tutorial, we will focus on the `Measure`

of the prevalence of `Cancer (excluding skin cancer) among adults aged >=18 years`

. This variable is the ratio of `Respondantss aged >18 who report ever having been told by a doctor, nurse, or other health professional that they have any other types of cancer`

divided by the total number of respondents aged 18 or older. I know from reading about this variable on the 500 cities website that it was collected from the Behavioral Risk Factor Surveillance System which is a national telephone survey that collects state-level data about U.S. resident behaviors and health outcomes. The BRFSS completes around 400,000 phone interviews annually. Phone surveys are notoriously non-representative. How do you reach people without landlines or cellphones? People who work odd hours and don’t answer when you call? People who have three kids and no time to answer a phone call to take a survey about their health? You get the idea… No dataset’s is perfect, and it is important to keep each dataset’s limitations in mind as we work through our analyses.

Once you familiarize yourself with the dataset content and variable construction, it’s good practice explore variables of interest by plotting their *distribution*. This helps you get a sense of the variable’s center (mean/median/mode) and spread (standard deviation/variance). The best way to quickly visualize a variable’s distribution is using a *histogram*. Histograms are a visual representation of the count of instances of a variable that fall into an interval of values. They are easy to plot in base `R`

with the `hist()`

function:

```
cancer <- health %>% filter(Measure == "Cancer (excluding skin cancer) among adults aged >=18 Years")
hist(cancer$Data_Value, 100, main = "Adult cancer rate", xlab = "Rate amongst adults older than 18", ylab = "Frequency")
```

The `100`

specifies the number of bins into which we want to place our data set. Try changing the number of bins and see what happens. `main`

specifies the main title. You can change axis labels using `xlab`

and `ylab`

.

When I look at this histogram, the first thing I see is that the data has a long right-hand tail. This suggests that there is a group of cities that have abnormally high cancer rates. `dplyr`

can help us quickly identify these cities. By looking at the histogram, counties with rates above `10`

seem to be abnormally low. But we’re data scientists, so let’s use something a bit more serious than visual inspection of a histogram. Let’s start with our measures of center:

`mean(cancer$Data_Value)`

`## [1] NA`

`median(cancer$Data_Value)`

`## [1] NA`

Huh!? What’s going on here!? Important `R`

lesson. Most datasets have missing values, which we signal in `R`

using `NA`

. `NA`

means `Not Available`

, i.e. this data was not collected or is missing for some other reason. You may also see `NaN`

, which is also important but *different*. `NaN`

means `Not a Number`

. If you try to do something crazy like `0/0`

, R will return `NaN`

(more info about this can be found here. When you try to compute base `R`

functions on vectors or matrices that contain `NAs`

, `R`

will return `NA`

. So are we doomed? No, it’s OK, simply add this:

`mean(cancer$Data_Value, na.rm = T)`

`## [1] 5.448483`

`median(cancer$Data_Value, na.rm = T)`

`## [1] 5.3`

Ok, it’s not actually that simple (nothing ever is guys). If you have missing data, you need to figure out WHERE your data is missing and if possible WHY. Missing data, just like sampling strategies, can introduce bias into any analyses you conduct with your data. This is especially important when working with survey data where characteristics of respondents may create systematic variations in response rates for certain questions. For example, marginalized folks may not want to disclose information because it makes them more vulnerable. Let’s use `dplyr`

to see where we have missing data:

```
cancer %>% filter(is.na(Data_Value)) %>%
group_by(CityName) %>%
summarize(n = n()) %>%
arrange(desc(n))
```

```
## # A tibble: 277 x 2
## CityName n
## <fctr> <int>
## 1 Houston 50
## 2 New York 23
## 3 Kansas City 16
## 4 Dallas 15
## 5 San Antonio 15
## 6 Knoxville 14
## 7 Tucson 12
## 8 San Diego 10
## 9 Gastonia 9
## 10 Los Angeles 9
## # ... with 267 more rows
```

This code uses `dplyr`

to filter code where the `Data_Value`

is listed as `NA`

using the `is.na()`

function. `is.na()`

returns a `TRUE`

or `FALSE`

based on, well, whether or not your data is `NA`

. If `x <- NA`

, then `is.na(x)`

is `TRUE`

.

I don’t see any clear patterns in the cities with missing data here except for under-representation of Houston. Let’s look at another variable, ’GeographicLevel`, which is the level at which data is collected (national, census tract, or city):

```
cancer %>% filter(is.na(Data_Value)) %>%
group_by(GeographicLevel) %>%
summarize(n=n()) %>%
arrange(desc(n))
```

```
## # A tibble: 1 x 2
## GeographicLevel n
## <fctr> <int>
## 1 Census Tract 800
```

Huh, interesting. Here we see that all of our missing data is at the `Census Tract`

level. If we extract only data collected at the level of a city, we should remove missing observations.

```
cancer <- cancer %>% filter(GeographicLevel == "City")
mean(cancer$Data_Value)
```

`## [1] 5.7263`

`median(cancer$Data_Value)`

`## [1] 5.8`

Excellent. We no longer need to add `na.rm=T`

to our `mean()`

and `median()`

functions since we’ve removed rows containing missing data.

After inspecting the cleaned `cancer`

data set, I realized that there are also multiple `Data_Value_Types`

for this `Measure`

. This includes `Crude Prevalence`

and `Age-adjusted prevalence`

, i.e two different ways to “adjust” the prevalance of different health issues, one overall prevalence (`Crude Prevalence`

) and the other adjusted by age (`Age-adjusted prevalence`

). We’ll need to select one to proceed. Let’s go with `Age-adjusted prevalence`

:

`cancer <- cancer %>% filter(Data_Value_Type == "Age-adjusted prevalence")`

Let’s re-plot the histogram^{1}:

`hist(cancer$Data_Value, 100, main = "Adult cancer rate", xlab = "Age-adjusted prevalence amongst adults older than 18", ylab = "Frequency")`

Now that we’ve resolved the `NA`

issue, let’s get back to our first task: to identify cities with abnormally high and low rates of cancer. We typically refer to observations with abnormally high or low observations as *outliers*. Outliers are observations whose values are abnormally far from the other values in a sample. The *standard deviation* of a variable describes the spread of observations of that variable around the central value of the data set (mean). We can compute standard deviation to identify variables at the tails of the distribution above. Let’s compute the standard deviation of these rates and generate lists of cities with exceptionally high (> 2 standard deviations) and low (<2 standard deviations) cancer rates:

```
sd <- sd(cancer$Data_Value)
print(paste("Our standard deviation is:", sd))
```

`## [1] "Our standard deviation is: 0.494553745121425"`

```
highsd <- mean(cancer$Data_Value) + 2*sd
lowsd <- mean(cancer$Data_Value) - 2*sd
print(paste("Very low rates of cancer are below", lowsd, "while very high rates are above", highsd, sep=" "))
```

`## [1] "Very low rates of cancer are below 4.86329250975715 while very high rates are above 6.84150749024285"`

Now let’s use `dplyr`

to identify cities of high and low cancer rates:

```
high_rate <- cancer %>% filter(Data_Value > highsd) %>%
select(CityName, StateAbbr, Data_Value) %>%
arrange(desc(Data_Value))
low_rate <- cancer %>% filter(Data_Value < lowsd) %>%
select(CityName, StateAbbr, Data_Value) %>%
arrange(desc(Data_Value))
```

First, our high rate cities:

```
## CityName StateAbbr Data_Value
## 1 Spokane WA 6.9
## 2 Spokane Valley WA 6.9
```

And our low rate:

```
## CityName StateAbbr Data_Value
## 1 Miami FL 4.8
## 2 Cicero IL 4.8
## 3 Camden NJ 4.8
## 4 Newark NJ 4.8
## 5 Lynwood CA 4.7
## 6 Santa Ana CA 4.7
## 7 Union City CA 4.7
## 8 Passaic NJ 4.7
## 9 Paterson NJ 4.7
## 10 Daly City CA 4.6
## 11 South Gate CA 4.6
## 12 Mission TX 4.6
## 13 Alhambra CA 4.5
## 14 Baldwin Park CA 4.5
## 15 Milpitas CA 4.5
## 16 Hialeah FL 4.5
## 17 El Paso TX 4.5
## 18 El Monte CA 4.4
## 19 Union City NJ 4.4
## 20 McAllen TX 4.4
## 21 Pharr TX 4.4
## 22 Brownsville TX 4.3
## 23 Edinburg TX 4.3
## 24 Laredo TX 4.1
```

In future classes we’ll learn how to visualize this data spatially and detect spatial components of our outliers. WOWWWW!

Outliers can significantly affect the statistics we compute. Mean and standard deviation are easily affected by extreme observations since these extremes are a part of their computation. A number of *robust* statistics exist that are not easily affected by outliers and skew. If you are worried about the impact of outliers and skew on your analyses, consider learning more about these *robust* statistics such as the Interquartile Range (IQR) (middle 50% of the data set) and median.

Histograms are a fantastic way to visualize the spread of numerical data for a single variable. Bar charts can help us visualize data patterns across categories. Say we want to quickly see the number of observations in our cancer data collected in each state:

```
ggplot(data = cancer) +
geom_bar(mapping = aes(x = StateAbbr))
```

OK, that’s a nice start, but that’s a fantastically ugly bar chart. Let’s clean it up:

```
cancer %>% group_by(StateAbbr) %>%
summarize(n = n()) %>%
ungroup() %>%
arrange(desc(n)) %>%
mutate(StateAbbr = ordered(StateAbbr, levels = unique(StateAbbr))) %>%
ggplot(data = .) +
geom_bar(mapping = aes(x = StateAbbr, y = n), stat="identity", fill = "purple", color = "orange") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("State") +
ylab("Number of cities") +
ggtitle("500 cities dataset")
```

First, YES, you can pipe straight into `ggplot()`

. You can tell `ggplot()`

to use the piped data with the `.`

symbol in `ggplot(data = .)`

, which essentially tells `ggplot`

to use what you’ve handed it via pipes. Note that once you send the data to `ggplot()`

you’re no longer piping (`%>%`

), but adding specifications to `ggplot()`

using `+`

.

Second, I’ve use `dplyr`

to clean and organize the data I want to plot by grouping by state, counting observations by state, and arranging rows by the number of observations for each state. Notice the `ungroup()`

and `mutate()`

steps. We pipe the data into groups and summarize it, but to plot the nice ordered bar plot you see above with high count states at the left and low count cities on the right, you then need to ungroup the data and mutate things to order the group ID (`StateAbbr`

) and factorize it with levels. Whew. Or you can just ignore those steps and plot bars that aren’t in order; that’s totally fine too. I also tilted the text on the x-axis so state names are easier to read and I added some color. I know, I know, purple and orange are bold choices (ahem, Go Clemson Tigers), but I wanted to show you how to change both the fill and the border of your bars. Finally, don’t forget to add `stat="identity"`

, per the `geom_bar()`

documentation: *By default, geom_bar uses stat=“count” which makes the height of the bar proportion to the number of cases in each group (or if the weight aethetic is supplied, the sum of the weights). If you want the heights of the bars to represent values in the data, use stat=“identity” and map a variable to the y aesthetic.* For more on the many, many, many ways you can display your data with `geom_bar()`

, go here.

Back to our central challenge. We have money to promote health in regions around the country. Imagine you have to prioritize certain states, i.e. Utah gets the money but the poor folks back in Tennessee are left hanging. How would you begin to determine which state should receive support?

One of the most important things you can take away from this lab is the following:

NOT ALL OBSERVED DIFFERENCES ARE STATISTICALLY DIFFERENT.

Let me just repeat that:

**NOT ALL OBSERVED DIFFERENCES ARE STATISTICALLY DIFFERENT.**

Now I’ll show you how this is true. Let’s begin by grouping our data at the state level and computing the average cancer rate for each state:

```
cancer %>% group_by(StateAbbr) %>%
summarize(mean = mean(Data_Value)) %>%
ungroup() %>%
arrange(desc(mean)) %>%
mutate(StateAbbr = ordered(StateAbbr, levels = unique(StateAbbr))) %>%
ggplot(data = .) +
geom_bar(mapping = aes(x = StateAbbr, y = mean), stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("State") +
ylab("Average rate of cancer") +
ggtitle("500 cities dataset")
```

Easy. Let’s give money to Oregon and Kentucky. They seem to be doing much worse than anyone else. Probably all the fancy micro-brews and Bourbon. But wait! Remember what I said:

**NOT ALL OBSERVED DIFFERENCES ARE STATISTICALLY DIFFERENT.**

Enter an extremely powerful data visualization tool: box and whisker plots.

```
cancer %>% filter(StateAbbr %in% c("OR", "UT", "NJ", "KY", "CA", "HI")) %>%
ggplot(data = .) +
geom_boxplot(mapping = aes(x = StateAbbr, y = Data_Value)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("State") +
ylab("Average rate of cancer") +
ggtitle("500 cities dataset")
```

What’s going on here? I’ve selected six states, two that are the “worst off” (OR and KY), two that appear to have low cancer rates (NJ and HI), and two states in the middle (UT and CA). If you look at the bar graph, you’d say that Oregon has unquestionably higher cancer rates than California. Well, look again. Think of the box and whisker plot above as variable distributions turned on their sides. The mean of the distribution is shown by the horizontal bar. The white box around the mean represents the 25% to 75% quartiles in the distribution, and the thin vertical line represents the area describing 95% of the distribution. See that dot for `NJ`

? That’s an *outlier*, a city whose value falls outside of the 95% distribution for the data set.

There are a number of different rules for determining if a point is an outlier, but the method that `R`

and `ggplot2`

use is the “1.5 rule”. If a data point is:

- less than Q1 - 1.5*IQR
- greater than Q3 + 1.5*IQR

then that point is classed as an “outlier”. Here, `Q1`

and `Q3`

refer to the data set’s quartiles, or points at which the ranked data set can be divided into four equal groups. The `IQR`

is the Interquartile Range or the middle 50% of a data set (Q3-Q1). The whiskers are defined as:

- upper whisker = min(max(x), Q3 + 1.5 * IQR)
- lower whisker = max(min(x), Q1 Ã¢ÂÂ 1.5 * IQR)

A way to visualize this is as follows:

Now that we understand what the box and whisker plots are helping us visualize, let’s interpret our box and whisker Boxes that are relatively tall indicate a state with cities that have many different cancer rates. California, for example, is a large state containing many observations (cities). Hawaii, on the other hand, contains only one observation so there is no variation in observed cancer rates. Notice that Oregon and Kentucky are much higher than the other box plots. This is suggestive of a difference between these states and the other states in rates of cancer. Notice, however, that the whiskers of California and Oregon overlap, suggesting that (albeit with very low probability) the distribution of cancer rates in California may not be significantly different from the distribution of cancer rates in Oregon.

So far, we’ve focused on the variation of a single variable. It is often very useful to explore how two variables *covary*. We can think of the *covariance* of two variables as the joint variation of two variables about their common mean. To remove any influence from difference in the measurement units of the two variables, we often compute *correlation*, which is simply the covariance of two variables divided by the product of the standard deviations of each variable.^{2} A correlation of +1 indicates a perfect *direct* relationship between two variables while -1 indicates a perfect *indirect* relationship.

Before we begin, we’ll need to clean our data set. Refer back to R for Data Science, Chapter 12.3 for a reminder of (1) what tidy data should look like and (2) how to gather and spread data. In our case, an observation is spread across multiple rows, i.e. each city-time is repeated in multiple rows for each `Measure`

. Using the `spread()`

function from the `tidyr`

package in the `tidyverse`

we can create a separate column for each measure. This will make it much easier to select `Data_Values`

for specific measures to compute correlation. I start by filtering out observations we’re not interested in (select only 2014, focus on cities, and select only `Age-adjusted prevalence`

outcomes). Then I select the subset of columns I’m interested in and `spread()`

the data:

```
# spread data and subset based on Year and Data_Value_Type
health_tidy <- health %>%
filter(GeographicLevel == "City" & Data_Value_Type == "Age-adjusted prevalence" & Year == 2014) %>%
select(CityName, StateAbbr, Measure, Data_Value, PopulationCount) %>%
spread(key = Measure, value = Data_Value)
```

Now we have tidy data! Each row is an observation (City-Year) and each column is a variable. Back to our original objective… compute correlation. Imagine you are interested in the correlation between physical activity and mental health:

`cor(health_tidy$`Mental health not good for >=14 days among adults aged >=18 Years`, health_tidy$`Physical health not good for >=14 days among adults aged >=18 Years`)`

`## [1] 0.9402613`

This suggests that there is a strong direct correlation between physical activity and mental health, i.e. higher levels of physical activity are generally associated with higher levels of mental health. So working out decreases depression? Not so fast.

CORRELATION DOES NOT IMPLY CAUSATION.

Let me repeat that:

**CORRELATION DOES NOT IMPLY CAUSATION**

Should I repeat again? Ok.

**CORRELATION DOES NOT IMPLY CAUSATION**

Visit this website for many examples of how two random and unrelated variables may covary across space or time (‘Age of Miss America’ and ‘Murders by steam, hot vapours and hot objects’), though I think there’s some truth to this one as Nicolas Cage generally makes me want to drown myself:

Establishing a causal link between two variables requires fairly sophisticated statistics. If you are interested in learning more about this, I highly recommend looking into the field of causal inference. I also recommend befriending an econometrician. The point here is that you, as data producers and consumers, should be very critical of any claims you read/produce establishing a causal link between one thing and another.

Back to our health data example. A nice way to visualize *correlation* between two variables is with a scatter plot. Again, this won’t help us establish causality but it will help is explore general and important patterns in our data:

```
ggplot(data = health_tidy, aes(`Mental health not good for >=14 days among adults aged >=18 Years`, `Physical health not good for >=14 days among adults aged >=18 Years`)) +
geom_point() +
xlab("Mental health") +
ylab("Physical activity")
```

`geom_point()`

makes it easy for us to change the symbology of these points based on groups they belong to or other variables. We could, for example, change point size based on the population of the city.

```
ggplot(data = health_tidy, aes(`Mental health not good for >=14 days among adults aged >=18 Years`, `Physical health not good for >=14 days among adults aged >=18 Years`)) +
geom_point(aes(size = PopulationCount))+
xlab("Mental health") +
ylab("Physical activity")
```