Objectives


Pre-lab assignment


My assumptions


Set up

library(tidyverse)
library(ggExtra) # for awesome plots
library(rgdal)
library(spdep)
library(spgwr)

OLS

Classic linear regression estimates a linear relationship between a dependent variable \(y\) and a set of explanatory variables X:

\[ y_i = X_i \beta + \epsilon_i \] Ordinary least squares (OLS) estimation is referred to as BLUE or as a Best Linear Unbiased Estimator. If we are interested in estimating the effect of education (\(x_1\)) on income (\(y\)), our OLS estimate of \(\beta\) in the equation above is estimated by minimizing the sum of the squared prediction errors. Say what? Let’s visualize this:

Say you’re interested in predicting the effect of education on income because you want to know why all the PhDs make the BIG BUCKS (sarcasm). You go out and collect information on folks’ income and education levels and plot them on the nice plot above. OLS estimates the line that minimizes the sum of the squared prediction errors, or the distance between the line and each point (vertical lines). This distance is the difference between the observed data from our sample (Bob’s income and education) and our predicted estimate of the population relationship (average effect of education on income for white male individuals).1 In this figure, the line represents the predicted effect of education on income. The slope of this predicted effect line is our \(\beta\) coefficient in the equation above. The fit of our modeled effect depends on how far the predicted effect is from each actual observation.

If the key assumptions required to use this BLUE approach are true, then we can make statistical inferences about the entire population based on the estimated \(\beta\) coefficient. These assumptions include:

  1. Random errors have a mean of zero (i.e. no systematic bias or mis-specification in the population regression equation);
  2. The random errors have a constant variances (i.e. homoskedastic) and are uncorrelated;
  3. The random errors have a normal distribution.

Do these assumptions hold with spatial data?

When a value in one location depends on the values of its neighbors, our errors are no longer uncorrelated and may not have a normal distribution. Depending on the nature of the spatial dependence, OLS will be either inefficient with incorrect standard errors or biased and inconsistent. The take home message? OLS tends to break down in the face of spatial dependence…

We’ve already talked a lot about spatial dependence, but let’s take a second to think through why spatial dependence might occur. First, our actual data collection process may induce measurement error. The spatial units at which the data is collected may not reflect the underlying process (i.e. population sliced at Census tract boundaries or yield sliced at county levels). Second, our variable of interest may have an important spatial dynamic, i.e. location and position may drive the observed values of the process of interest.

The best visualization of these different types of spatial dependence I have found comes from this overview of spatial regression. First, let’s look at a situation in which our error is correlated across space, henceforth referred to as a SPATIAL ERROR MODEL:

Here, the subscripts \(i\) and \(j\) refer to two different locations in space, for example county \(i\) and county \(j\). Say we are looking at how average county-level average education influences county-level average income. We may not expect location and distance to influence income and education.2 We will, however, expect that the way in which counties are sliced could mean that neighboring counties are more similar than distant counties, and that this similarity may influence our income/education estimates. In this case, the ERROR in our model is correlated. This means we violate our OLS BLUE assumption that our error terms are uncorrelated and therefore any OLS estimates will be inefficient and biased. These spatially correlated errors are indicative of omitted (spatially correlated) covariates. We can specify how counties neighbor one another (you know how to do this from last week) and explicitly estimate this spatial structure in the model to control for correlation in the error term.

The other type of spatial dependence is a bit more complex. In this case, the value of our outcome variable at position \(i\) is influenced by the value of our outcome variable at position \(j\). We also expect that values of our covariates at location \(i\) influence values at location \(j\). In addition, we anticipate that our errors are correlated. Basically, this is a case in which you feel fairly confident that location/distance really drive the values of the variables you’re interested in… perhaps events in one place predict an increased likelihood of similar events in neighboring places (spatial diffusion). We’ll refer to this type of model as a SPATIAL LAG MODEL, though you may also hear it referred to as a spatial auto-regressive model in the literature.

Formally, if we write the standard OLS as follows:

\[ y_i = X_i \beta + \epsilon_i \]

Then the spatial error model and spatial lag models differ in the way we define our error term \(\epsilon_i\). For spatial error models, we define this term as \[\epsilon_i = \lambda w_i \epsilon_i + u_i\] and for spatial lag models we define this term as \[\epsilon_i = \rho w_i y_j + u_i\] I’ll explain these key differences below.


Steps

Before I dive into the differences between spatial error models and spatial lag models, I want to provide a brief overview of the steps you should take when modeling your data with a spatial regression. Before you ever do anything with any data, you should tidy and visualize your data to understand/identify outliers and missing values, and to familiarize yourself with your data. Plot your data and look for signs of spatial patterns. Test for spatial autocorrelation in your variables of interest (moran.test()). If you feel like your data has spatial dependence, then I recommend working through the following steps:

  1. Inspect your data, construct an appropriate spatial neighborhood, and assign weights.
  2. Run an OLS regression with your variables of interest. This regression assumes spatial independence.
  3. Inspect results and plot your residuals.
  4. Compute the Moran’s I value for the residuals using the weights you constructed in Step 1. If you see signs of spatial dependence, continue!
  5. To determine whether to estimate a spatial error model or a spatial lag model, compute the Lagrange Multiplier statistics. If these statistics are the same, go robust. Don’t worry, I explain this below.
  6. Run that spatial regression!

To be really thorough, I would test sensitivity of your major test results and regressions to different neighborhood/weighting assumptions.

I’ll explain each of these steps below. First, I’ll provide a bit more information the two types of spatial dependence we will explore in this lab.


Spatial error models

Remember that spatial error models assume that only the error terms in the regression are correlated. Let’s look at this another way:

\[ y_i = X_i \beta + \epsilon_i \\ \epsilon_i = \lambda w_i \epsilon_i + u_i\] where \(u_i \sim N(0, \sigma^2 I)\).

Therefore, our final spatial error model can be specified as:

\[ y_i = X_i \beta + \lambda w_i \epsilon_i + u_i \]

Mmmmm, mathy. This is actually pretty easy to understand. In a normal regression we estimate a single error term, \(\epsilon_i\). In a spatial error model, we split this error into random error and spatially structured error. Here, random error (unexplained by our model) is \(u_i\), independent identically distributed (i.i.d.) errors. Our spatially structured error is composed of the the spatial error coefficient, \(\lambda\), and our original \(\epsilon\) error term now weighted by our weight matrix we learned how to build last week, \(w_i\). If there is no spatial correlation between our errors, then \(\lambda = 0\). If \(\lambda \neq 0\), OLS is unbiased and consistent, but our standard errors will be wrong and the betas will be inefficient.

Let’s move on to implementing this in R. A quick note about our data before we begin: Our data comes from the Applied Spatial Data Analysis in R text by Bivand and colleagues. This dataset covers the 281 census tracts of New York.

ny <- readOGR("./data/NY8_utm18.shp")
## OGR data source with driver: ESRI Shapefile 
## Source: "./data/NY8_utm18.shp", layer: "NY8_utm18"
## with 281 features
## It has 17 fields
glimpse(ny@data)
## Observations: 281
## Variables: 17
## $ AREANAME   <fctr> Binghamton city, Binghamton city, Binghamton city,...
## $ AREAKEY    <fctr> 36007000100, 36007000200, 36007000300, 36007000400...
## $ X          <dbl> 4.069397, 4.639371, 5.709063, 7.613831, 7.315968, 8...
## $ Y          <dbl> -67.3533, -66.8619, -66.9775, -65.9958, -67.3183, -...
## $ POP8       <dbl> 3540, 3560, 3739, 2784, 2571, 2729, 3952, 993, 1908...
## $ TRACTCAS   <dbl> 3.08, 4.08, 1.09, 1.07, 3.06, 1.06, 2.09, 0.02, 2.0...
## $ PROPCAS    <dbl> 0.000870, 0.001146, 0.000292, 0.000384, 0.001190, 0...
## $ PCTOWNHOME <dbl> 0.32773109, 0.42682927, 0.33773959, 0.46160483, 0.1...
## $ PCTAGE65P  <dbl> 0.14661017, 0.23511236, 0.13800481, 0.11889368, 0.1...
## $ Z          <dbl> 0.14197, 0.35555, -0.58165, -0.29634, 0.45689, -0.2...
## $ AVGIDIST   <dbl> 0.2373852, 0.2087413, 0.1708548, 0.1406045, 0.15777...
## $ PEXPOSURE  <dbl> 3.167099, 3.038511, 2.838229, 2.643366, 2.758587, 2...
## $ Cases      <dbl> 3.08284, 4.08331, 1.08750, 1.06515, 3.06017, 1.0638...
## $ Xm         <dbl> 4069.397, 4639.371, 5709.063, 7613.831, 7315.968, 8...
## $ Ym         <dbl> -67353.3, -66861.9, -66977.5, -65995.8, -67318.3, -...
## $ Xshift     <dbl> 423391.0, 423961.0, 425030.6, 426935.4, 426637.5, 4...
## $ Yshift     <dbl> 4661502, 4661993, 4661878, 4662859, 4661537, 466192...

Our dependent variable, Z, is the log transformation of the mean counts of leukemia cases by census tract in NY.3 Our covariates of interest (independent variables) include PEXPOSURE or the inverse distance to the closet inactive hazardous waste site, PCTAGE65P or the proportion of people 65 or older, and PCTOWNHOME, or the percentage of people who own their own home. Now it may be reasonable to assume that space/location/distance are driving our observed rates of leukemia, but for now, we’ll assume we only expect to see spatial interactions in the error terms due to the random slicing of our data into census tracts. We’ll relax this when we estimate a spatial lag model below, then we can compare the two!

Anytime you start a regression, or any data analysis for that matter, start by LOOKING at your data. Let’s start by inspecting the relationship between our covariate of interest, distance to the closest TCE location (PEXPOSURE) and rates of leukemia:

library(ggExtra)

p <- ggplot(data = ny@data, aes(x = ny$PEXPOSURE, y = ny$Z, color = ny$AREANAME)) +
  geom_point() +
  theme(legend.position = "none") +
  xlab("Exposure") +
  ylab("Leukemia rates")

ggMarginal(p, type = "histogram")

I don’t see any clear relationship here, but let’s keep playing. As I said above in the Steps section, it’s often a good idea just to start with a simple linear model that doesn’t account for spatial dynamics. You can pull the residuals out of the model (the noise not explained by your model, \(\hat{\epsilon_i} = \hat{y_i} - y\)) and see if there are interesting spatial patterns at play. This would suggest that spatial dynamics explain some of the variation in your data and that you should explore a bit more! Let’s estimate a nice old OLS regression with our variables of interest:

ny.ols <- lm(Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = ny)
summary(ny.ols)
## 
## Call:
## lm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = ny)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7417 -0.3957 -0.0326  0.3353  4.1398 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.51728    0.15856  -3.262  0.00124 ** 
## PEXPOSURE    0.04884    0.03506   1.393  0.16480    
## PCTAGE65P    3.95089    0.60550   6.525 3.22e-10 ***
## PCTOWNHOME  -0.56004    0.17031  -3.288  0.00114 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6571 on 277 degrees of freedom
## Multiple R-squared:  0.1932, Adjusted R-squared:  0.1844 
## F-statistic:  22.1 on 3 and 277 DF,  p-value: 7.306e-13

lm() takes an equation and a data as it’s argument. You specify the function as INDEPENDENT_VAR ~ DEPENDENT_VAR_1 + DEP_VAR_2, etc. Our results show us the Estimate (think \(\beta\) coefficient), the standard error of that estimate (Std.Error) and whether that estimate is significant (Pr(>|t|)). These results suggest that PEXPOSURE is not significant, but that our other census variables are (home ownership and percent oldies). It’s also useful to glance at our R-squared estimates. \(R^2\) gives us a sense of how well our model fits the data… it tells us how much variation we’ve explained out of the TOTAL variation in the model. A higher \(R^2\) means we’ve explained more of the variation. If you add tons of variables to a regression, you’ll probably explain more of the variation. Adjusted \(R^2\) penalizes you as you add more variables to your model by taking into account the ratio \((N-1)/(N-k-1)\) where \(N\) is the number of observations and \(k\) is the number of predictors in your model. In our case, the fit ain’t stellar, i.e. there are probably other significant factors that explain variation in leukemia rates that we haven’t included in our model. More on \(R^2\) here and here.

What we really care about is our residuals. In classic \(y_i = X_i\beta + \epsilon_i\) regression world, our residuals are caught up in \(\epsilon\) and the residual standard error is measuring the standard deviation of \(\epsilon\). If residual standard error were 0, then our model fits the data perfectly. This never happens… there’s always some noise in our data that we can’t explain, and this is captured in our residuals. Remember that residuals are the distance between the observed values (Bob’s income and education) and the predicted relationship between variables of interest (the estimated linear effect of education on income).

Let’s look at our residuals.4

ny$resid <- residuals(ny.ols) # pull residuals out of OLS object
spplot(ny, "resid") # add resid column to our ny shapefile

There appears to be some spatial autocorrelation at play, but let’s test this formally. To do this, we will compute the Moran’s I for the residuals in the model. We first have to construct our neighborhood and weights (see last tutorial). I am assuming a KNN neighborhood where the four nearest neighbors are relevant.

library(spdep)
id <- row.names(as(ny, "data.frame"))
coords <- coordinates(ny)
ny_nb <- knn2nb(knearneigh(coords, k = 4), row.names = id)
plot(ny)
plot(ny_nb, coords, add=T, main = "KNN = 4 Neighborhood")

We then assign weights to our neighborhood and use lm.morantest() to test the spatial autocorrelation in our detrended data (i.e. regression residuals):

ny_w <- nb2listw(ny_nb, style = "B", zero.policy = T)
lm.morantest(ny.ols, ny_w)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data =
## ny)
## weights: ny_w
## 
## Moran I statistic standard deviate = 2.1191, p-value = 0.01704
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##      0.069602396     -0.010104423      0.001414785

HOUSTON, WE HAVE SPATIAL AUTOCORRELATION. Note here that we are NOT using the moran.test() function, but a special function intended to test for spatial autocorrelation in regression residuals, lm.morantest(). This test takes into account the fact that the variable under consideration is a residual, computed from a regression. The usual Moran’s I test statistic does not. Since our p-value is significant, we can reject our null of randomly distributed data and assume we’re dealing with spatially autocorrelated residuals. What this really means is that if we include spatial dependence in our model, we can remove spatial noise from our estimate of proximity to hazardous sites on leukemia rates. Important. Let’s do it. We’ll start with our simple spatial error model in which only the error terms are correlated. We can do this with the errorsarlm() function:

ny.sem <- errorsarlm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = ny, listw = ny_w, quiet = T)
summary(ny.sem)
## 
## Call:errorsarlm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, 
##     data = ny, listw = ny_w, quiet = T)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -1.618805 -0.387857 -0.024901  0.368261  3.991433 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -0.553553   0.172557 -3.2079  0.001337
## PEXPOSURE    0.049443   0.041132  1.2020  0.229345
## PCTAGE65P    3.853170   0.618974  6.2251 4.813e-10
## PCTOWNHOME  -0.475158   0.184211 -2.5794  0.009897
## 
## Lambda: 0.041922, LR test value: 3.3325, p-value: 0.067925
## Asymptotic standard error: 0.021396
##     z-value: 1.9594, p-value: 0.050071
## Wald statistic: 3.8391, p-value: 0.050071
## 
## Log likelihood: -277.0626 for error model
## ML residual variance (sigma squared): 0.4186, (sigma: 0.64699)
## Number of observations: 281 
## Number of parameters estimated: 6 
## AIC: 566.13, (AIC for lm: 567.46)

We don’t see any changes in the direction of our estimated effects, but we do see changes in the magnitude. Notice that Lambda (\(\lambda\)), which is indicative of spatial autucorrelation, is significant. Let’s visualize the trend component of our model and the stochastic component (residuals):

ny$fitted_sem <- ny.sem$fitted.values
spplot(ny, "fitted_sem", main = "Trend")

The trend shows the tract-specific patterns in PEXPOSURE after controlling for the predictors in the model. Now for the residuals…

ny$resid_sem <- ny.sem$residuals
spplot(ny, "resid_sem", main = "Residuals")

The residuals show us the variation in PEXPOSURE that we haven’t accounted for in our model. Visualizing these residuals can help us inspect spatial patterns in this unexplained variation and may give us ideas about other (spatially-structured) predictors to include in our model!


Spatial lag model (SAR)

The second type of spatial dependence is a bit more complex. In this case, we believe that the values of \(y\) in one unit \(i\) are directly influenced by the values of \(y\) found in \(i's\) neighbors.5 We’ll refer to this type of model as a SPATIAL LAG MODEL.

Formally:

\[ y_i = X_i \beta + \epsilon_i \\ \epsilon_i = \rho w_i y_j + u_i \]

Therefore, our final spatial error model can be specified as:

\[ y_i = x_i \beta + \rho w_i y_j + u_i \] where a positive value of \(\rho\) indicates that counties are expected to have higher rates of leukemia if, on average, their neighbors have higher leukemia values. Our \(\beta\) coefficient is different than in classic OLS in that we are estimating the effect of exposure on leukemia while controlling for the extent to which variation in a county’s level of leukemia can be explained by the rate of leukemia in connected counties. \(w_i y_i\) is the spatially lagged dependent variable for weights matrix \(w_i\), \(X_i\) is a matrix of observations on the explanatory variables, \(u_i\) is a vector of error terms and \(\rho\) is the spatial coefficient. If there is no spatial dependence, and \(y\) does not depend on neighboring \(y\) values, then \(\rho = 0\). Rho (\(\rho\)) reflects the spatial dependence inherent in our sample data by measuring the average influence on observations by their neighboring observations. If \(\rho\) is significant , OLS is biased and inconsistent (so don’t use it!). Assuming we’ve already confirmed that we have residual autocorrelation after specifying a simple OLS (lm.morantest() and visualizations of residuals) and that our variables of interest are spatially autocorrelated, we can incorporate this spatial dependence in our previous OLS model as follows:

ny.lag <- lagsarlm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = ny, ny_w, zero.policy = T)
summary(ny.lag)
## 
## Call:
## lagsarlm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = ny, 
##     listw = ny_w, zero.policy = T)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.55786 -0.38297 -0.01927  0.36193  3.95890 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -0.468433   0.157234 -2.9792  0.002890
## PEXPOSURE    0.030857   0.034826  0.8860  0.375597
## PCTAGE65P    3.701524   0.601607  6.1527 7.616e-10
## PCTOWNHOME  -0.470071   0.168432 -2.7909  0.005257
## 
## Rho: 0.04704, LR test value: 5.5728, p-value: 0.018242
## Asymptotic standard error: 0.019371
##     z-value: 2.4283, p-value: 0.015168
## Wald statistic: 5.8969, p-value: 0.015168
## 
## Log likelihood: -275.9424 for lag model
## ML residual variance (sigma squared): 0.41473, (sigma: 0.64399)
## Number of observations: 281 
## Number of parameters estimated: 6 
## AIC: 563.88, (AIC for lm: 567.46)
## LM test for residual autocorrelation
## test value: 1.267, p-value: 0.26033

As expected, our spatial autoregressive parameter (\(\rho\), Rho) is highly significant (p = .015). Our test for residual autocorrelation is not significant, which suggests that we cannot reject the null of randomly distributed residuals. Let’s inspect our residuals:

ny$lag_resid <- residuals(ny.lag)
spplot(ny, "lag_resid")

It’s always worth restructuring your weights matrix to see how residual dependence changes. Let’s create a row-standardized queen neighborhood and re-estimate our model:

queen <- poly2nb(ny)
ny_qw <- nb2listw(queen) # remeber this defaults to row standardized

Now let’s re-estimate our model:

ny.lag.queen <- lagsarlm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = ny, ny_qw, zero.policy = T)
summary(ny.lag.queen)
## 
## Call:
## lagsarlm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = ny, 
##     listw = ny_qw, zero.policy = T)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -1.623321 -0.393140 -0.013717  0.346620  4.056860 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -0.500508   0.155993 -3.2085  0.001334
## PEXPOSURE    0.043560   0.034486  1.2631  0.206547
## PCTAGE65P    3.635007   0.598853  6.0700 1.279e-09
## PCTOWNHOME  -0.408772   0.169152 -2.4166  0.015667
## 
## Rho: 0.23013, LR test value: 7.5038, p-value: 0.0061569
## Asymptotic standard error: 0.082188
##     z-value: 2.8001, p-value: 0.0051087
## Wald statistic: 7.8405, p-value: 0.0051087
## 
## Log likelihood: -274.9769 for lag model
## ML residual variance (sigma squared): 0.41046, (sigma: 0.64067)
## Number of observations: 281 
## Number of parameters estimated: 6 
## AIC: 561.95, (AIC for lm: 567.46)
## LM test for residual autocorrelation
## test value: 1.385, p-value: 0.23926

It looks like this model does a bit better, but not much. If you encounter this issue with your data, explore different neighborhood structures and, alas, you may have to explore more complex spatial models that what we’ve explored here.

Interpreting effects in spatial lag models is complicated. Each observation will have a direct effect on its predictor as well as an indirect effect through its neighbors. You can read more about how to tease out these effects here.

Comparing models

You can test for spatial autocorrelation in specific variables using moran.test() and spatial autocorrelation in the residuals of your regression using lm.morantest(). These tests tell you whether spatial autocorrelation is a concern and whether your OLS estimates are garbage. If you find spatial autocorrelation, however, you don’t really know whether you should estimate a spatial error model or a spatial lag model. Lagrange Multiplier test specifies the alternative hypotheses of spatial lag in the dependent variable and the presence of spatial lag specifically in the error term (Nice). A test for spatial dependence in linear models!?! Thank you R!

We can use the lm.LMtests() function and use the option test = "all" to test for different types of spatial structure. You need to pass the function a regression object (ny.ols) and a spatial listw weights object (ny_w). This test returns:

Look back at the equations. If you look back at our equations, these tests test for \(\lambda = 0\) (in the case of our spatial error model) and \(\rho = 0\) (in the case of our spatial lag model). We have to use the same weights matrix for this test. You can and should test sensitivity of the test to different weights matrices.

LM <- lm.LMtests(ny.ols, ny_w, test = "all")
LM
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data =
## ny)
## weights: ny_w
## 
## LMerr = 3.266, df = 1, p-value = 0.07073
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data =
## ny)
## weights: ny_w
## 
## LMlag = 5.9832, df = 1, p-value = 0.01444
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data =
## ny)
## weights: ny_w
## 
## RLMerr = 1.0467, df = 1, p-value = 0.3063
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data =
## ny)
## weights: ny_w
## 
## RLMlag = 3.7639, df = 1, p-value = 0.05237
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data =
## ny)
## weights: ny_w
## 
## SARMA = 7.0299, df = 2, p-value = 0.02975

Here LMlag refers to our spatial lag model and LMerr refers to our spatial error model. Since both LMerr and LMlag are both statistically significant, we need to look at their robust counterparts. These “robust” tests are robust to missing error or missing lag information and are actually robust to the presence of the other type of autocorrelation. Their outputs are in the RLMlag and RLMerr outputs above. SARMA test for the significance of both LMerr + RLMlag.6

t <- sacsarlm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = ny, ny_w, zero.policy = T)


It appears that our RLMlag model has the best fit, so we should run a spatial lag model on our data. What does this mean in the real world? That there is spatial autocorrelation between leukemia rates in neighboring census tracts in the state of New York. If you’re interested in learning more about Lagrange Multiplier tests, I recommend this.

Luc Anselin (2005) developed a fantastic visualization of this process, which you can read all about in this extremely detailed overview of spatial regression in R.

Another common way to compare model fit and make a decision about which model to use is to compare the log likelihood of the models you estimate. You can think of this statistic as the likelihood of the parameters given the data. Look back at the summary() outputs. You’ll see that the log likelihood is reported with each test. We want to maximize the log likelihood, so go with the model with the highest value. As you test different weight matrices, it’s good practice to compare how the log likelihood changes.

Geographic weighted regression

Another nice exploratory tool is geographic weighted regression. This is good for exploring spatial heterogeneity in the relationships between variables… or, does your effect (\(\beta\)) vary across space.

\[ y_i = \sum_{ij} \beta_j (u_i, v_i) x_{ij} + \epsilon_i \]

where our simple \(\beta\) coefficient is replaced with functions \(\beta_j(.,.)\) of location \((u,v)\). Think of this as estimating the relationship between \(y_i\) and our data \(x_{ij}\) within a circle centered on location \((u,v)\) with radius \(r\). Our findings are sensitive to this radius \(r\), also known as the bandwidth. The spgwr package includes a function called grw.sel() that estimates the root mean square prediction error for GWRs with different bandwidths. The function spits out a bandwidth minimizing the RMSE (i.e. best fit model).

library(spgwr)
bw <- gwr.sel(Z ~ PEXPOSURE, data = ny)
## Bandwidth: 68843.15 CV score: 148.1085 
## Bandwidth: 111279.3 CV score: 147.9664 
## Bandwidth: 137506.4 CV score: 147.9317 
## Bandwidth: 146795.2 CV score: 147.9234 
## Bandwidth: 159456.3 CV score: 147.9143 
## Bandwidth: 167281.4 CV score: 147.9096 
## Bandwidth: 172117.5 CV score: 147.907 
## Bandwidth: 175106.4 CV score: 147.9055 
## Bandwidth: 176953.7 CV score: 147.9046 
## Bandwidth: 178095.3 CV score: 147.9041 
## Bandwidth: 178800.9 CV score: 147.9038 
## Bandwidth: 179237 CV score: 147.9036 
## Bandwidth: 179506.5 CV score: 147.9034 
## Bandwidth: 179673 CV score: 147.9034 
## Bandwidth: 179776 CV score: 147.9033 
## Bandwidth: 179839.6 CV score: 147.9033 
## Bandwidth: 179878.9 CV score: 147.9033 
## Bandwidth: 179903.2 CV score: 147.9033 
## Bandwidth: 179918.3 CV score: 147.9033 
## Bandwidth: 179927.5 CV score: 147.9033 
## Bandwidth: 179933.3 CV score: 147.9033 
## Bandwidth: 179936.8 CV score: 147.9033 
## Bandwidth: 179939 CV score: 147.9033 
## Bandwidth: 179940.4 CV score: 147.9033 
## Bandwidth: 179941.2 CV score: 147.9033 
## Bandwidth: 179941.7 CV score: 147.9033 
## Bandwidth: 179942 CV score: 147.9033 
## Bandwidth: 179942.2 CV score: 147.9033 
## Bandwidth: 179942.4 CV score: 147.9033 
## Bandwidth: 179942.4 CV score: 147.9033 
## Bandwidth: 179942.5 CV score: 147.9033 
## Bandwidth: 179942.5 CV score: 147.9032 
## Bandwidth: 179942.5 CV score: 147.9032 
## Bandwidth: 179942.5 CV score: 147.9032 
## Bandwidth: 179942.5 CV score: 147.9032 
## Bandwidth: 179942.5 CV score: 147.9032 
## Bandwidth: 179942.6 CV score: 147.9032 
## Bandwidth: 179942.6 CV score: 147.9032
bw
## [1] 179942.6
ny.gwr <- gwr(Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = ny, bandwidth = bw)
ny.gwr
## Call:
## gwr(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = ny, 
##     bandwidth = bw)
## Kernel function: gwr.Gauss 
## Fixed bandwidth: 179942.6 
## Summary of GWR coefficient estimates at data points:
##                   Min.   1st Qu.    Median   3rd Qu.      Max.  Global
## X.Intercept. -0.522172 -0.520740 -0.520154 -0.514439 -0.511092 -0.5173
## PEXPOSURE     0.047176  0.048032  0.049527  0.049722  0.050477  0.0488
## PCTAGE65P     3.911526  3.933832  3.959192  3.962334  3.979552  3.9509
## PCTOWNHOME   -0.559358 -0.557968 -0.557682 -0.555498 -0.554563 -0.5600

We can attach our estimated coefficients to our original ny dataset to visualize how the relationship between exposure and leukemia varies across space:

ny$gwr_exp_coeff <- ny.gwr$SDF$PEXPOSURE
spplot(ny, "gwr_exp_coeff")

There appears to be an interesting spatial gradient here. What about our other variables?

ny$gwr_th_coeff <- ny.gwr$SDF$PCTOWNHOME
spplot(ny, "gwr_th_coeff")

ny$gwr_pop_coeff <- ny.gwr$SDF$PCTAGE65P
spplot(ny, "gwr_pop_coeff")

Note that this approach does NOT account for the spatial structure of the data. It’s really just like a bunch of mini-regressions within a circle of bandwidth \(r\). This is why I recommend using GWR only as an exploratory tool.

If you’re interested in learning more about geographic weighted regression, see here and here.

Additional resources


  1. It’s very important to think about the population your sample actually represents. If you collect data from a bunch of old white dudes and make statements about the average effect of education on income for the entire population of the U.S., an army of women and minorities will prove you wrong.

  2. I.e. if you live in a valley versus mountain you make more money or are better educated, or if you live near highly educated people you’re also likely to be highly educated. In actuality, these assumptions may be valid. Think about how elevation affects house size and income along the Wasatch Front! And think about how folks with similar socioeconomic and educational (even political!) backgrounds tend to cluster in the same neighborhoods!

  3. \(Z_i = \frac{1000(Y_i + 1)}{n_i}\)

  4. Be careful when adding residuals back to the data.frame - if you have NAs in any of the rows for the variables included in the regression, these rows will be skipped, so the residuals(ols) will no longer align with the original object. You can avoid this by filtering out NAs prior to regression, but be sure to think deeply about WHY the NAs exist and what the implications of their elimination may be.

  5. Here we assume you’re working with a continuous variable. If you have binary dependent variables, things get more complex… Ward and Gleditsch 2002

  6. If you want to explore this, you may want to look at the sacsarlm() tool that lets you model both spatially autocorrelated error and spatial lag in one model.