Pre-lab assignment

My assumptions

Set up

library(ggExtra) # for awesome plots


Classic linear regression estimates a linear relationship between a dependent variable \(y\) and a set of explanatory variables X:

\[ y_i = X_i \beta + \epsilon_i \] Ordinary least squares (OLS) estimation is referred to as BLUE or as a Best Linear Unbiased Estimator. If we are interested in estimating the effect of education (\(x_1\)) on income (\(y\)), our OLS estimate of \(\beta\) in the equation above is estimated by minimizing the sum of the squared prediction errors. Say what? Let’s visualize this:

Say you’re interested in predicting the effect of education on income because you want to know why all the PhDs make the BIG BUCKS (sarcasm). You go out and collect information on folks’ income and education levels and plot them on the nice plot above. OLS estimates the line that minimizes the sum of the squared prediction errors, or the distance between the line and each point (vertical lines). This distance is the difference between the observed data from our sample (Bob’s income and education) and our predicted estimate of the population relationship (average effect of education on income for white male individuals).1 In this figure, the line represents the predicted effect of education on income. The slope of this predicted effect line is our \(\beta\) coefficient in the equation above. The fit of our modeled effect depends on how far the predicted effect is from each actual observation.

If the key assumptions required to use this BLUE approach are true, then we can make statistical inferences about the entire population based on the estimated \(\beta\) coefficient. These assumptions include:

  1. Random errors have a mean of zero (i.e. no systematic bias or mis-specification in the population regression equation);
  2. The random errors have a constant variances (i.e. homoskedastic) and are uncorrelated;
  3. The random errors have a normal distribution.

Do these assumptions hold with spatial data?

When a value in one location depends on the values of its neighbors, our errors are no longer uncorrelated and may not have a normal distribution. Depending on the nature of the spatial dependence, OLS will be either inefficient with incorrect standard errors or biased and inconsistent. The take home message? OLS tends to break down in the face of spatial dependence…

We’ve already talked a lot about spatial dependence, but let’s take a second to think through why spatial dependence might occur. First, our actual data collection process may induce measurement error. The spatial units at which the data is collected may not reflect the underlying process (i.e. population sliced at Census tract boundaries or yield sliced at county levels). Second, our variable of interest may have an important spatial dynamic, i.e. location and position may drive the observed values of the process of interest.

The best visualization of these different types of spatial dependence I have found comes from this overview of spatial regression. First, let’s look at a situation in which our error is correlated across space, henceforth referred to as a SPATIAL ERROR MODEL:

Here, the subscripts \(i\) and \(j\) refer to two different locations in space, for example county \(i\) and county \(j\). Say we are looking at how average county-level average education influences county-level average income. We may not expect location and distance to influence income and education.2 We will, however, expect that the way in which counties are sliced could mean that neighboring counties are more similar than distant counties, and that this similarity may influence our income/education estimates. In this case, the ERROR in our model is correlated. This means we violate our OLS BLUE assumption that our error terms are uncorrelated and therefore any OLS estimates will be inefficient and biased. These spatially correlated errors are indicative of omitted (spatially correlated) covariates. We can specify how counties neighbor one another (you know how to do this from last week) and explicitly estimate this spatial structure in the model to control for correlation in the error term.

The other type of spatial dependence is a bit more complex. In this case, the value of our outcome variable at position \(i\) is influenced by the value of our outcome variable at position \(j\). We also expect that values of our covariates at location \(i\) influence values at location \(j\). In addition, we anticipate that our errors are correlated. Basically, this is a case in which you feel fairly confident that location/distance really drive the values of the variables you’re interested in… perhaps events in one place predict an increased likelihood of similar events in neighboring places (spatial diffusion). We’ll refer to this type of model as a SPATIAL LAG MODEL, though you may also hear it referred to as a spatial auto-regressive model in the literature.

Formally, if we write the standard OLS as follows:

\[ y_i = X_i \beta + \epsilon_i \]

Then the spatial error model and spatial lag models differ in the way we define our error term \(\epsilon_i\). For spatial error models, we define this term as \[\epsilon_i = \lambda w_i \epsilon_i + u_i\] and for spatial lag models we define this term as \[\epsilon_i = \rho w_i y_j + u_i\] I’ll explain these key differences below.


Before I dive into the differences between spatial error models and spatial lag models, I want to provide a brief overview of the steps you should take when modeling your data with a spatial regression. Before you ever do anything with any data, you should tidy and visualize your data to understand/identify outliers and missing values, and to familiarize yourself with your data. Plot your data and look for signs of spatial patterns. Test for spatial autocorrelation in your variables of interest (moran.test()). If you feel like your data has spatial dependence, then I recommend working through the following steps:

  1. Inspect your data, construct an appropriate spatial neighborhood, and assign weights.
  2. Run an OLS regression with your variables of interest. This regression assumes spatial independence.
  3. Inspect results and plot your residuals.
  4. Compute the Moran’s I value for the residuals using the weights you constructed in Step 1. If you see signs of spatial dependence, continue!
  5. To determine whether to estimate a spatial error model or a spatial lag model, compute the Lagrange Multiplier statistics. If these statistics are the same, go robust. Don’t worry, I explain this below.
  6. Run that spatial regression!

To be really thorough, I would test sensitivity of your major test results and regressions to different neighborhood/weighting assumptions.

I’ll explain each of these steps below. First, I’ll provide a bit more information the two types of spatial dependence we will explore in this lab.

Spatial error models

Remember that spatial error models assume that only the error terms in the regression are correlated. Let’s look at this another way:

\[ y_i = X_i \beta + \epsilon_i \\ \epsilon_i = \lambda w_i \epsilon_i + u_i\] where \(u_i \sim N(0, \sigma^2 I)\).

Therefore, our final spatial error model can be specified as:

\[ y_i = X_i \beta + \lambda w_i \epsilon_i + u_i \]

Mmmmm, mathy. This is actually pretty easy to understand. In a normal regression we estimate a single error term, \(\epsilon_i\). In a spatial error model, we split this error into random error and spatially structured error. Here, random error (unexplained by our model) is \(u_i\), independent identically distributed (i.i.d.) errors. Our spatially structured error is composed of the the spatial error coefficient, \(\lambda\), and our original \(\epsilon\) error term now weighted by our weight matrix we learned how to build last week, \(w_i\). If there is no spatial correlation between our errors, then \(\lambda = 0\). If \(\lambda \neq 0\), OLS is unbiased and consistent, but our standard errors will be wrong and the betas will be inefficient.

Let’s move on to implementing this in R. A quick note about our data before we begin: Our data comes from the Applied Spatial Data Analysis in R text by Bivand and colleagues. This dataset covers the 281 census tracts of New York.

ny <- readOGR("./data/NY8_utm18.shp")
## OGR data source with driver: ESRI Shapefile 
## Source: "./data/NY8_utm18.shp", layer: "NY8_utm18"
## with 281 features
## It has 17 fields
## Observations: 281
## Variables: 17
## $ AREANAME   <fctr> Binghamton city, Binghamton city, Binghamton city,...
## $ AREAKEY    <fctr> 36007000100, 36007000200, 36007000300, 36007000400...
## $ X          <dbl> 4.069397, 4.639371, 5.709063, 7.613831, 7.315968, 8...
## $ Y          <dbl> -67.3533, -66.8619, -66.9775, -65.9958, -67.3183, -...
## $ POP8       <dbl> 3540, 3560, 3739, 2784, 2571, 2729, 3952, 993, 1908...
## $ TRACTCAS   <dbl> 3.08, 4.08, 1.09, 1.07, 3.06, 1.06, 2.09, 0.02, 2.0...
## $ PROPCAS    <dbl> 0.000870, 0.001146, 0.000292, 0.000384, 0.001190, 0...
## $ PCTOWNHOME <dbl> 0.32773109, 0.42682927, 0.33773959, 0.46160483, 0.1...
## $ PCTAGE65P  <dbl> 0.14661017, 0.23511236, 0.13800481, 0.11889368, 0.1...
## $ Z          <dbl> 0.14197, 0.35555, -0.58165, -0.29634, 0.45689, -0.2...
## $ AVGIDIST   <dbl> 0.2373852, 0.2087413, 0.1708548, 0.1406045, 0.15777...
## $ PEXPOSURE  <dbl> 3.167099, 3.038511, 2.838229, 2.643366, 2.758587, 2...
## $ Cases      <dbl> 3.08284, 4.08331, 1.08750, 1.06515, 3.06017, 1.0638...
## $ Xm         <dbl> 4069.397, 4639.371, 5709.063, 7613.831, 7315.968, 8...
## $ Ym         <dbl> -67353.3, -66861.9, -66977.5, -65995.8, -67318.3, -...
## $ Xshift     <dbl> 423391.0, 423961.0, 425030.6, 426935.4, 426637.5, 4...
## $ Yshift     <dbl> 4661502, 4661993, 4661878, 4662859, 4661537, 466192...

Our dependent variable, Z, is the log transformation of the mean counts of leukemia cases by census tract in NY.3 Our covariates of interest (independent variables) include PEXPOSURE or the inverse distance to the closet inactive hazardous waste site, PCTAGE65P or the proportion of people 65 or older, and PCTOWNHOME, or the percentage of people who own their own home. Now it may be reasonable to assume that space/location/distance are driving our observed rates of leukemia, but for now, we’ll assume we only expect to see spatial interactions in the error terms due to the random slicing of our data into census tracts. We’ll relax this when we estimate a spatial lag model below, then we can compare the two!

Anytime you start a regression, or any data analysis for that matter, start by LOOKING at your data. Let’s start by inspecting the relationship between our covariate of interest, distance to the closest TCE location (PEXPOSURE) and rates of leukemia:


p <- ggplot(data = ny@data, aes(x = ny$PEXPOSURE, y = ny$Z, color = ny$AREANAME)) +
  geom_point() +
  theme(legend.position = "none") +
  xlab("Exposure") +
  ylab("Leukemia rates")

ggMarginal(p, type = "histogram")

I don’t see any clear relationship here, but let’s keep playing. As I said above in the Steps section, it’s often a good idea just to start with a simple linear model that doesn’t account for spatial dynamics. You can pull the residuals out of the model (the noise not explained by your model, \(\hat{\epsilon_i} = \hat{y_i} - y\)) and see if there are interesting spatial patterns at play. This would suggest that spatial dynamics explain some of the variation in your data and that you should explore a bit more! Let’s estimate a nice old OLS regression with our variables of interest:

ny.ols <- lm(Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = ny)
## Call:
## lm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = ny)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7417 -0.3957 -0.0326  0.3353  4.1398 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.51728    0.15856  -3.262  0.00124 ** 
## PEXPOSURE    0.04884    0.03506   1.393  0.16480    
## PCTAGE65P    3.95089    0.60550   6.525 3.22e-10 ***
## PCTOWNHOME  -0.56004    0.17031  -3.288  0.00114 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.6571 on 277 degrees of freedom
## Multiple R-squared:  0.1932, Adjusted R-squared:  0.1844 
## F-statistic:  22.1 on 3 and 277 DF,  p-value: 7.306e-13

lm() takes an equation and a data as it’s argument. You specify the function as INDEPENDENT_VAR ~ DEPENDENT_VAR_1 + DEP_VAR_2, etc. Our results show us the Estimate (think \(\beta\) coefficient), the standard error of that estimate (Std.Error) and whether that estimate is significant (Pr(>|t|)). These results suggest that PEXPOSURE is not significant, but that our other census variables are (home ownership and percent oldies). It’s also useful to glance at our R-squared estimates. \(R^2\) gives us a sense of how well our model fits the data… it tells us how much variation we’ve explained out of the TOTAL variation in the model. A higher \(R^2\) means we’ve explained more of the variation. If you add tons of variables to a regression, you’ll probably explain more of the variation. Adjusted \(R^2\) penalizes you as you add more variables to your model by taking into account the ratio \((N-1)/(N-k-1)\) where \(N\) is the number of observations and \(k\) is the number of predictors in your model. In our case, the fit ain’t stellar, i.e. there are probably other significant factors that explain variation in leukemia rates that we haven’t included in our model. More on \(R^2\) here and here.

What we really care about is our residuals. In classic \(y_i = X_i\beta + \epsilon_i\) regression world, our residuals are caught up in \(\epsilon\) and the residual standard error is measuring the standard deviation of \(\epsilon\). If residual standard error were 0, then our model fits the data perfectly. This never happens… there’s always some noise in our data that we can’t explain, and this is captured in our residuals. Remember that residuals are the distance between the observed values (Bob’s income and education) and the predicted relationship between variables of interest (the estimated linear effect of education on income).

Let’s look at our residuals.4

ny$resid <- residuals(ny.ols) # pull residuals out of OLS object
spplot(ny, "resid") # add resid column to our ny shapefile