- Determine whether there is spatial structure to your data.
- Figure out how to model this using both spatial error models and spatial lag models.
- Compare spatial regression models.

- Read through the RSpatial chapter on spatial regression
- If you are unfamiliar with regression, consider reading this or watching a bit of this or this. If you like math, read this.

- I am going to assume the word “regression” is not new to you, that you know what a \(\beta\) coefficient is, that you get what \(y = mx + b\) is about, and
*crucially*that you know what a residual is. Confused? Watch this. - Regression is a complex art-form. Junk regressions are everywhere. This tutorial focuses on how to deal with SPATIAL data and less on the art of regression and all the assumptions underlying OLS.
- I’ve peppered this tutorial with additional resources. If you’re unfamiliar with some of the basic stats stuff here, I’m assuming you’ll go look into the resources I’ve provided.
- You completed last week’s tutorial and understand how to make a spatial weights matrix and neighborhood.
- You remember that correlation DOES NOT IMPLY causation.

```
library(tidyverse)
library(ggExtra) # for awesome plots
library(rgdal)
library(spdep)
library(spgwr)
```

Classic linear regression estimates a *linear* relationship between a dependent variable \(y\) and a set of explanatory variables *X*:

\[ y_i = X_i \beta + \epsilon_i \] Ordinary least squares (OLS) estimation is referred to as **BLUE** or as a Best Linear Unbiased Estimator. If we are interested in estimating the effect of education (\(x_1\)) on income (\(y\)), our OLS estimate of \(\beta\) in the equation above is estimated by minimizing the sum of the squared prediction errors. Say what? Let’s visualize this:

Say you’re interested in predicting the effect of education on income because you want to know why all the PhDs make the BIG BUCKS (*sarcasm*). You go out and collect information on folks’ income and education levels and plot them on the nice plot above. OLS estimates the line that *minimizes* the sum of the squared prediction errors, or the distance between the line and each point (vertical lines). This distance is the difference between the observed data from our *sample* (Bob’s income and education) and our predicted estimate of the *population* relationship (average effect of education on income for white male individuals).^{1} In this figure, the line represents the *predicted* effect of education on income. The *slope* of this predicted effect line is our \(\beta\) coefficient in the equation above. The fit of our modeled effect depends on how far the predicted effect is from each actual observation.

If the key assumptions required to use this BLUE approach are true, then we can make statistical inferences about the entire *population* based on the estimated \(\beta\) coefficient. These assumptions include:

- Random errors have a mean of zero (i.e. no systematic bias or mis-specification in the population regression equation);
- The random errors have a constant variances (i.e.
*homoskedastic*) and are*uncorrelated*; - The random errors have a normal distribution.

**Do these assumptions hold with spatial data?**

When a value in one location depends on the values of its neighbors, our errors are no longer uncorrelated and may not have a normal distribution. Depending on the nature of the spatial dependence, OLS will be either inefficient with incorrect standard errors or biased and inconsistent. The take home message? OLS tends to break down in the face of spatial dependence…

We’ve already talked a lot about spatial dependence, but let’s take a second to think through *why* spatial dependence might occur. First, our actual data collection process may induce measurement error. The spatial units at which the data is collected may not reflect the underlying process (i.e. population sliced at Census tract boundaries or yield sliced at county levels). Second, our variable of interest may have an important spatial dynamic, i.e. location and position may drive the observed values of the process of interest.

The best visualization of these different types of spatial dependence I have found comes from this overview of spatial regression. First, let’s look at a situation in which our *error* is correlated across space, henceforth referred to as a **SPATIAL ERROR MODEL**:

Here, the subscripts \(i\) and \(j\) refer to two different locations in space, for example county \(i\) and county \(j\). Say we are looking at how average county-level average education influences county-level average income. We may not expect location and distance to influence income and education.^{2} We will, however, expect that the way in which counties are sliced could mean that neighboring counties are more similar than distant counties, and that this similarity may influence our income/education estimates. In this case, the *ERROR* in our model is correlated. This means we violate our OLS BLUE assumption that our error terms are uncorrelated and therefore any OLS estimates will be inefficient and biased. These spatially correlated errors are indicative of omitted (spatially correlated) covariates. We can specify how counties neighbor one another (you know how to do this from last week) and explicitly estimate this spatial structure in the model to control for correlation in the error term.

The other type of spatial dependence is a bit more complex. In this case, the *value* of our outcome variable at position \(i\) is influenced by the *value* of our outcome variable at position \(j\). We also expect that values of our covariates at location \(i\) influence values at location \(j\). In addition, we anticipate that our errors are correlated. Basically, this is a case in which you feel fairly confident that location/distance really drive the values of the variables you’re interested in… perhaps events in one place predict an increased likelihood of similar events in neighboring places (spatial diffusion). We’ll refer to this type of model as a **SPATIAL LAG MODEL,** though you may also hear it referred to as a spatial auto-regressive model in the literature.

Formally, if we write the standard OLS as follows:

\[ y_i = X_i \beta + \epsilon_i \]

Then the *spatial error model* and *spatial lag models* differ in the way we define our error term \(\epsilon_i\). For *spatial error models*, we define this term as \[\epsilon_i = \lambda w_i \epsilon_i + u_i\] and for *spatial lag models* we define this term as \[\epsilon_i = \rho w_i y_j + u_i\] I’ll explain these key differences below.

Before I dive into the differences between spatial error models and spatial lag models, I want to provide a brief overview of the steps you should take when modeling your data with a spatial regression. Before you ever do anything with any data, you should tidy and visualize your data to understand/identify outliers and missing values, and to familiarize yourself with your data. Plot your data and look for signs of spatial patterns. Test for spatial autocorrelation in your variables of interest (`moran.test()`

). If you feel like your data has spatial dependence, then I recommend working through the following steps:

- Inspect your data, construct an appropriate spatial neighborhood, and assign weights.

- Run an OLS regression with your variables of interest. This regression assumes spatial independence.
- Inspect results and plot your residuals.

- Compute the Moran’s I value for the residuals using the weights you constructed in Step 1. If you see signs of spatial dependence, continue!
- To determine whether to estimate a
**spatial error model**or a**spatial lag model**, compute the Lagrange Multiplier statistics. If these statistics are the same, go robust. Don’t worry, I explain this below. - Run that spatial regression!

To be really thorough, I would test sensitivity of your major test results and regressions to different neighborhood/weighting assumptions.

I’ll explain each of these steps below. First, I’ll provide a bit more information the two types of spatial dependence we will explore in this lab.

Remember that spatial error models assume that only the error terms in the regression are correlated. Let’s look at this another way:

\[ y_i = X_i \beta + \epsilon_i \\ \epsilon_i = \lambda w_i \epsilon_i + u_i\] where \(u_i \sim N(0, \sigma^2 I)\).

Therefore, our final spatial error model can be specified as:

\[ y_i = X_i \beta + \lambda w_i \epsilon_i + u_i \]

Mmmmm, mathy. This is actually pretty easy to understand. In a normal regression we estimate a single error term, \(\epsilon_i\). In a spatial error model, we split this error into random error and spatially structured error. Here, random error (unexplained by our model) is \(u_i\), independent identically distributed (i.i.d.) errors. Our spatially structured error is composed of the the spatial error coefficient, \(\lambda\), and our original \(\epsilon\) error term now weighted by our weight matrix we learned how to build last week, \(w_i\). If there is *no* spatial correlation between our errors, then \(\lambda = 0\). If \(\lambda \neq 0\), OLS is unbiased and consistent, but our standard errors will be wrong and the betas will be inefficient.

Let’s move on to implementing this in `R`

. A quick note about our data before we begin: Our data comes from the Applied Spatial Data Analysis in R text by Bivand and colleagues. This dataset covers the 281 census tracts of New York.

`ny <- readOGR("./data/NY8_utm18.shp")`

```
## OGR data source with driver: ESRI Shapefile
## Source: "./data/NY8_utm18.shp", layer: "NY8_utm18"
## with 281 features
## It has 17 fields
```

`glimpse(ny@data)`

```
## Observations: 281
## Variables: 17
## $ AREANAME <fctr> Binghamton city, Binghamton city, Binghamton city,...
## $ AREAKEY <fctr> 36007000100, 36007000200, 36007000300, 36007000400...
## $ X <dbl> 4.069397, 4.639371, 5.709063, 7.613831, 7.315968, 8...
## $ Y <dbl> -67.3533, -66.8619, -66.9775, -65.9958, -67.3183, -...
## $ POP8 <dbl> 3540, 3560, 3739, 2784, 2571, 2729, 3952, 993, 1908...
## $ TRACTCAS <dbl> 3.08, 4.08, 1.09, 1.07, 3.06, 1.06, 2.09, 0.02, 2.0...
## $ PROPCAS <dbl> 0.000870, 0.001146, 0.000292, 0.000384, 0.001190, 0...
## $ PCTOWNHOME <dbl> 0.32773109, 0.42682927, 0.33773959, 0.46160483, 0.1...
## $ PCTAGE65P <dbl> 0.14661017, 0.23511236, 0.13800481, 0.11889368, 0.1...
## $ Z <dbl> 0.14197, 0.35555, -0.58165, -0.29634, 0.45689, -0.2...
## $ AVGIDIST <dbl> 0.2373852, 0.2087413, 0.1708548, 0.1406045, 0.15777...
## $ PEXPOSURE <dbl> 3.167099, 3.038511, 2.838229, 2.643366, 2.758587, 2...
## $ Cases <dbl> 3.08284, 4.08331, 1.08750, 1.06515, 3.06017, 1.0638...
## $ Xm <dbl> 4069.397, 4639.371, 5709.063, 7613.831, 7315.968, 8...
## $ Ym <dbl> -67353.3, -66861.9, -66977.5, -65995.8, -67318.3, -...
## $ Xshift <dbl> 423391.0, 423961.0, 425030.6, 426935.4, 426637.5, 4...
## $ Yshift <dbl> 4661502, 4661993, 4661878, 4662859, 4661537, 466192...
```

Our dependent variable, `Z`

, is the log transformation of the mean counts of leukemia cases by census tract in NY.^{3} Our covariates of interest (independent variables) include `PEXPOSURE`

or the inverse distance to the closet inactive hazardous waste site, `PCTAGE65P`

or the proportion of people 65 or older, and `PCTOWNHOME`

, or the percentage of people who own their own home. Now it may be reasonable to assume that space/location/distance are driving our observed rates of leukemia, but for now, we’ll assume we only expect to see spatial interactions in the error terms due to the random slicing of our data into census tracts. We’ll relax this when we estimate a spatial lag model below, then we can compare the two!

Anytime you start a regression, or any data analysis for that matter, start by LOOKING at your data. Let’s start by inspecting the relationship between our covariate of interest, distance to the closest TCE location (`PEXPOSURE`

) and rates of leukemia:

```
library(ggExtra)
p <- ggplot(data = ny@data, aes(x = ny$PEXPOSURE, y = ny$Z, color = ny$AREANAME)) +
geom_point() +
theme(legend.position = "none") +
xlab("Exposure") +
ylab("Leukemia rates")
ggMarginal(p, type = "histogram")
```

I don’t see any clear relationship here, but let’s keep playing. As I said above in the **Steps** section, it’s often a good idea just to start with a simple linear model that doesn’t account for spatial dynamics. You can pull the residuals out of the model (the noise not explained by your model, \(\hat{\epsilon_i} = \hat{y_i} - y\)) and see if there are interesting spatial patterns at play. This would suggest that spatial dynamics explain some of the variation in your data and that you should explore a bit more! Let’s estimate a nice old OLS regression with our variables of interest:

```
ny.ols <- lm(Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = ny)
summary(ny.ols)
```

```
##
## Call:
## lm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = ny)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7417 -0.3957 -0.0326 0.3353 4.1398
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.51728 0.15856 -3.262 0.00124 **
## PEXPOSURE 0.04884 0.03506 1.393 0.16480
## PCTAGE65P 3.95089 0.60550 6.525 3.22e-10 ***
## PCTOWNHOME -0.56004 0.17031 -3.288 0.00114 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6571 on 277 degrees of freedom
## Multiple R-squared: 0.1932, Adjusted R-squared: 0.1844
## F-statistic: 22.1 on 3 and 277 DF, p-value: 7.306e-13
```

`lm()`

takes an equation and a data as it’s argument. You specify the function as `INDEPENDENT_VAR ~ DEPENDENT_VAR_1 + DEP_VAR_2, etc`

. Our results show us the `Estimate`

(think \(\beta\) coefficient), the standard error of that estimate (`Std.Error`

) and whether that estimate is significant (`Pr(>|t|)`

). These results suggest that `PEXPOSURE`

is not significant, but that our other census variables are (home ownership and percent oldies). It’s also useful to glance at our `R-squared`

estimates. \(R^2\) gives us a sense of how well our model fits the data… it tells us how much variation we’ve explained out of the TOTAL variation in the model. A higher \(R^2\) means we’ve explained more of the variation. If you add tons of variables to a regression, you’ll probably explain more of the variation. Adjusted \(R^2\) penalizes you as you add more variables to your model by taking into account the ratio \((N-1)/(N-k-1)\) where \(N\) is the number of observations and \(k\) is the number of predictors in your model. In our case, the fit ain’t stellar, i.e. there are probably other significant factors that explain variation in leukemia rates that we haven’t included in our model. More on \(R^2\) here and here.

What we really care about is our residuals. In classic \(y_i = X_i\beta + \epsilon_i\) regression world, our residuals are caught up in \(\epsilon\) and the residual standard error is measuring the standard deviation of \(\epsilon\). If residual standard error were `0`

, then our model fits the data perfectly. This never happens… there’s always some noise in our data that we can’t explain, and this is captured in our residuals. Remember that residuals are the distance between the observed values (Bob’s income and education) and the predicted relationship between variables of interest (the estimated linear effect of education on income).

Let’s look at our residuals.^{4}

```
ny$resid <- residuals(ny.ols) # pull residuals out of OLS object
spplot(ny, "resid") # add resid column to our ny shapefile
```