Objectives


Pre-lab assignment


My assumptions


Set up

library(tidyverse)
library(rgdal)
library(spdep)
library(spdplyr)
library(stringr)
library(RColorBrewer)
library(gstat)

Our data

Today we’ll be working with a dataset near and dear to my heart… corn yield! Sound super boring? Well, let me take a second to discuss the importance of corn in this fine nation of ours…

The US is the world’s leading producer of corn. Most of the corn we grow is used to feed livestock and produce ethanol. Lots of corn shows up in our food, even we don’t think of our food as containing corn (high fructose corn syrup, anyone?). Corn is also the main grain used to feed livestock in the U.S. That burger you’re eating? Corn. Those chicken nuggets? Corn. Corn and all of the resources (water, energy, land, labor) required to create that corn. More than 90 MILLION acres are cultivated with corn (out of a total of 1.9 billion acres in the U.S.). Take a second to ponder the environmental impacts of all that corn. Ready to drop everything to study corn? Yeah, I know.

We’ll be working with data from the USDA’s Natural Agricultural Statistics Service (NASS). Click the footnote to learn how to download this data on your own.2

corn <- read.csv("./data/iowa.csv")
glimpse(corn)
## Observations: 99
## Variables: 21
## $ Program          <fctr> SURVEY, SURVEY, SURVEY, SURVEY, SURVEY, SURV...
## $ Year             <int> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 201...
## $ Period           <fctr> YEAR, YEAR, YEAR, YEAR, YEAR, YEAR, YEAR, YE...
## $ Week.Ending      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Geo.Level        <fctr> COUNTY, COUNTY, COUNTY, COUNTY, COUNTY, COUN...
## $ State            <fctr> IOWA, IOWA, IOWA, IOWA, IOWA, IOWA, IOWA, IO...
## $ State.ANSI       <int> 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 1...
## $ Ag.District      <fctr> CENTRAL, CENTRAL, CENTRAL, CENTRAL, CENTRAL,...
## $ Ag.District.Code <int> 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 5...
## $ County           <fctr> BOONE, DALLAS, GRUNDY, HAMILTON, HARDIN, JAS...
## $ County.ANSI      <int> 15, 49, 75, 79, 83, 99, 127, 153, 157, 169, 1...
## $ Zip.Code         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Region           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ watershed_code   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Watershed        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Commodity        <fctr> CORN, CORN, CORN, CORN, CORN, CORN, CORN, CO...
## $ Data.Item        <fctr> CORN, GRAIN - YIELD, MEASURED IN BU / ACRE, ...
## $ Domain           <fctr> TOTAL, TOTAL, TOTAL, TOTAL, TOTAL, TOTAL, TO...
## $ Domain.Category  <fctr> NOT SPECIFIED, NOT SPECIFIED, NOT SPECIFIED,...
## $ Value            <dbl> 208.4, 204.0, 198.5, 209.0, 208.0, 214.7, 211...
## $ CV....           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...

First, I’ll use dplyr to clean things up a bit. I downloaded only data for CORN, GRAIN - YIELD, MEASURED IN BU/ACRE so I am going to drop the Data.Item column that tells me this is what I downloaded and rename the Value column (which is actually the yield data) Yield. I’m also going to drop the Program variable (all observations are from the survey), Year (all data is from 2016), and the Week.Ending and CV variables (all values as NA). It’s a bummer, but for the whatever reason, the Watershed and Watershed_code data is not available (though we now know how to use a spatial join to link watersheds to counties), so I’ll drop those too… basically I’m keeping all of the info we could use to geolocate the yield data and the actual yield data.

corn <- corn %>% select(State, State.ANSI, Ag.District, Ag.District.Code, County, County.ANSI, Yield = Value) 
glimpse(corn)
## Observations: 99
## Variables: 7
## $ State            <fctr> IOWA, IOWA, IOWA, IOWA, IOWA, IOWA, IOWA, IO...
## $ State.ANSI       <int> 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 1...
## $ Ag.District      <fctr> CENTRAL, CENTRAL, CENTRAL, CENTRAL, CENTRAL,...
## $ Ag.District.Code <int> 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 5...
## $ County           <fctr> BOONE, DALLAS, GRUNDY, HAMILTON, HARDIN, JAS...
## $ County.ANSI      <int> 15, 49, 75, 79, 83, 99, 127, 153, 157, 169, 1...
## $ Yield            <dbl> 208.4, 204.0, 198.5, 209.0, 208.0, 214.7, 211...

A first interesting question is whether there’s significant variation in yields across counties. One way to quickly visualize this is by building a histogram:

hist(corn$Yield, 100, xlab = "Corn yields (bu/ac)", main="")

We know that yields vary across space (and across time, but that’s for another day). Therefore, our first challenge is to visualize this spatial variation by making this dataset spatial. I know this data is at the county level and I have some attributes describing counties (County, County.ANSI, State, State.ANSI). ANSI stands for the American National Standards Institute (ANSI) codes. These are used across government institutions to help standardize data. What does this mean for us? That we should be able to join our data to a U.S. Census County shapefile! I just happen to have one here (minus counties outside of the contiguous US, sorry AK and HI you guys are just too far away… and I’m not sure Alaskan corn is a thing). I’ll start by filtering out Iowa:

library(spdplyr)
library(rgdal)
cty <- readOGR("./data/cb_2015_us_county_500k.shp")
## OGR data source with driver: ESRI Shapefile 
## Source: "./data/cb_2015_us_county_500k.shp", layer: "cb_2015_us_county_500k"
## with 3108 features
## It has 11 fields
## Integer64 fields read as strings:  ALAND AWATER
cty <- cty %>% filter(STATEFP == 19)
plot(cty)

I know, from looking up the FIPS code for Iowa that the values of State.ANSI and County.ANSI in the USDA dataset are the same as the STATEFP and COUNTYFP in the cty dataset. Hooray! We can join the data! The last issue is that the COUNTYFP and STATEFP variables in our cty shapefile have leading zeros (i.e. 015 instead of 15). Let’s add leading zeros to the County.ANSI dataset using the str_pad() function in the stringr package:

library(stringr)
corn$State.ANSI <- str_pad(corn$State.ANSI, 2, pad="0")
corn$County.ANSI <- str_pad(corn$County.ANSI, 3, pad = "0")

Now let’s join our corny data to the county shapefile. Remember, that some polygons in this shapefile will not have data. If we do this join right, these values will be listed as NA. First, I’m going to concatenate our state and county identifiers into a single ID for each county. This way we can create a unique ID for each county (otherwise County.ANSI has multiple instances of the same value across different states).

corn$ID <- paste0(corn$State.ANSI, corn$County.ANSI)
cty@data$ID <- paste0(cty@data$STATEFP, cty@data$COUNTYFP)
head(corn$ID)
## [1] "19015" "19049" "19075" "19079" "19083" "19099"
head(cty@data$ID)
## [1] "19041" "19045" "19111" "19135" "19027" "19075"

I’ll also take a quick peak at how many shared values there are across the two datasets:

length(intersect(cty@data$ID, corn$ID))
## [1] 99

99… out of 99 counties. YES! Let’s merge these suckers:

cty_corn <- merge(cty, corn, by.x = "ID", by.y = "ID", all.x=T)

Now this new shapefile contains all of our data:

glimpse(cty_corn@data)
## Observations: 99
## Variables: 18
## $ ID               <chr> "19041", "19045", "19111", "19135", "19027", ...
## $ STATEFP          <fctr> 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, ...
## $ COUNTYFP         <fctr> 041, 045, 111, 135, 027, 075, 137, 157, 163,...
## $ COUNTYNS         <fctr> 00465625, 00465211, 00465244, 00465256, 0046...
## $ AFFGEOID         <fctr> 0500000US19041, 0500000US19045, 0500000US191...
## $ GEOID            <fctr> 19041, 19045, 19111, 19135, 19027, 19075, 19...
## $ NAME             <fctr> Clay, Clinton, Lee, Monroe, Carroll, Grundy,...
## $ LSAD             <fctr> 06, 06, 06, 06, 06, 06, 06, 06, 06, 06, 06, ...
## $ ALAND            <fctr> 1469139469, 1799830897, 1340377291, 11233172...
## $ AWATER           <fctr> 13866649, 39514681, 55290807, 1522790, 21453...
## $ AREA_SQKM        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ State            <fctr> IOWA, IOWA, IOWA, IOWA, IOWA, IOWA, IOWA, IO...
## $ State.ANSI       <chr> "19", "19", "19", "19", "19", "19", "19", "19...
## $ Ag.District      <fctr> NORTHWEST, EAST CENTRAL, SOUTHEAST, SOUTH CE...
## $ Ag.District.Code <int> 10, 60, 90, 80, 40, 50, 70, 50, 60, 10, 20, 7...
## $ County           <fctr> CLAY, CLINTON, LEE, MONROE, CARROLL, GRUNDY,...
## $ County.ANSI      <chr> "041", "045", "111", "135", "027", "075", "13...
## $ Yield            <dbl> 202.0, 213.8, 194.6, 180.0, 203.1, 198.5, 185...

FINALLY! LET’S PLOT THIS:

library(RColorBrewer)
my.palette <- brewer.pal(n = 9, name = "YlGn") 
spplot(cty_corn, "Yield", col.regions = my.palette, cuts = 8, col = "gray", main = "Corn yields", xlab = "", ylab ="")

Yeah, I was fancy and made my own color palette with the RColorBrewer package. I used the all-powerful spplot() function to quickly plot a chloropleth map of this stuff. If you weren’t convinced that there are significant variations in yield (bushels/acre) across space, I hope this convinces you. What might be driving these variations? Do you see any obvious spatial patterns?

Let’s talk about polygons…

Polygon attributes are often aggregations of finer resolution data at a coarser scale, e.g. averaging of field-scale corn yields at the county scale or summing household population to the county scale. This aggregation and averaging process often adds error to our measurements. In addition, the boundaries of our polygons significantly impact our observation of the process we are interested in, and oftentimes these boundaries are fairly arbitrary. If you take a walk to the end of Cache County, Utah and cross into a neighboring county, do you expect to see any sudden changes? Probably not. What about if you’re on the run from the cops, heading north to Canada, and hoping for some indication of when and where you cross the US-Canada border? Well, lucky for you that certain parts of the border make this easy (see real picture below - kind of crazy). But in other parts, you’re just running through fairly homogeneous forest. What’s my point here? Boundaries and borders are quite arbitrary but often quite powerful.

MAUP

I haven’t been to Iowa, but if it’s anything like other parts of the Midwestern Corn-belt, it should look something like this:

Sea of corn. Amazing what we humans can do to the landscape! Think about this landscape. Now think about the data. Kind of arbitrary to measure yield at the county level, right? We could also measure yield at the plant level, field level, farm level, city level, state level, agro-eco region level, or watershed level. As I said above, counties are pretty random lines drawn across an otherwise continuous landscapes. Of course, counties are an important scale to consider because many human dynamics vary at this level… markets, policies, schools, etc. But still pretty random at the end of the day.3

The point here is that the scale at which you measure things MATTERS and CHANGES the patterns you are interested in.4 The modifiable areal unit problem (MAUP) essentially states that aggregation causes bias… said differently, areal units are a problem if their modification could create different results. A map showing dis-aggregated population estimates at a census block level will look much different than a map showing population at a state level. If we are able to design our study to match areal entities, we can reduce boundary-bias. In most cases, however, we have to use the arbitrary boundaries at which our data was collected. When this is the case, we have to think a lot about the spatial autocorrelation between neighboring entities and explicitly control for this in our analyses. What is spatial autocorrelation? Since corn yield doesn’t magically manifest at the county level, we might imagine that corn yields in surrounding counties are NOT independent. Farmers in one county may share techniques with those in a neighboring counties. All farmers in a state may receive certain subsidies or extension support. Farmer A in County X may be the cousin of Farmer B in County Y and they influence each others decision making. You get the idea. Spatial connections and relationships often eliminate independence in the data, and when our observations are not independent, we have to treat our data differently.

In what follows are are going to learn how to compute and control for spatial autocorrelation. The first step in computing spatial autocorrelation is defining how your polygons are related in space using a neighborhood. I’ll start by explaining how to build neighborhoods in R and common types of neighborhoods, then move on to the actual computation of spatial autocorrelation.


Neighboring objects and spatial weights

Just about any spatial analysis you conduct with polygon data will require the construction of a spatial weights matrix. This matrix basically describes how polygons are connected in space. Any analysis of areal data is crucially dependent on the assumptions implicit in the construction of your spatial weights matrix; therefore, we’ll spend a bit of time reviewing common approaches to constructing weight matrices. Building a weights matrix has two steps: (1) define your neighborhood and (2) assign weights to neighbors.

Defining a neighborhood

Neighborhoods describe how polygons in a shapefile are connected. We can define neighborhoods in a number of ways:

Image from this website.

Image from this website.

We can think of neighborhoods in terms of contiguity (shared boundary), distance (distance or K-nearest neighbors), or other weights (social distance, distance decay functions). The way we formalize neighbors will significantly impact our models, so choose wisely!

We can feed a polygon to the poly2nb() function in the spdep package to construct a new nb neighborhood object that is a list of the neighbors of each polygon in our shapefile. As the figure above indicates, we can define what constitutes a neighbor in a number of ways:

queen <- poly2nb(cty) 
rook <- poly2nb(cty, queen = F)

How are these different?

coords <- coordinates(cty)  # coordinates for polygon centroids
plot(cty)
plot(queen, coords, add=T, main = "Queen")

plot(cty)
plot(rook, coords, add=T, main = "Rook")

If you look closely, you’ll see that connected counties are different in the rook and queen neighbor objects.

We can also choose the \(k\) nearest polygons as neighbors:

id <- row.names(as(cty, "data.frame"))
kn1 <- knn2nb(knearneigh(coords, k = 1), row.names = id)
kn2 <- knn2nb(knearneigh(coords, k = 2), row.names = id)
kn3 <- knn2nb(knearneigh(coords, k = 3), row.names = id)

plot(cty)
plot(kn1, coords, add=T, main = "KNN = 1")

plot(cty)
plot(kn2, coords, add=T, main = "KNN = 2")

plot(cty)
plot(kn3, coords, add=T, main = "KNN = 3")

There are many other ways to build neighborhood structures. The neighborhood you construct largely depends on your data and your underlying hypotheses about how your data relates spatially.

Assigning weights to the neighborhoods

The objects we visualized above are of class nb (neighbor). We can apply standard functions like print(), summary(), or plot() to these objects to interrogate their contents. These objects list the sets of neighbors for our study area.

Once we figure out who would like to be our neighbor, we need assign weights to our list of neighbors. This basically transforms our list of neighbors into a matrix. This becomes important as we move towards calculating spatial autocorrelation and conducting spatial regression. When we assign weights, we can either assign binary (1-0) weights or more continuous weights (first neighbor 0.3, second neighbor 0.2, third 0.1, etc.). Binary weights are easy to understand: neighboring counties have a weight of 1, non-neighbors zero. Imagine we have a very simple shapefile:

With binary weights, neighbors (1,2 and 2,3) are given weights of one. Non-neighbors (1,3) are given weights of zero.

If we didn’t want to go binary, we could also assign a value of 1 to immediate neighbors and a value of 0.5 to neighbors one pixel away. Note, however that this approach increases the influence of links from observations with few neighbors where as binary weights tend to overemphasize observations with many neighbors.

We could use any function we deem theoretically relevant (inverse distance weighting, decay functions, etc.) to weigh our neighborhood structure. In cases where features have an unequal number of weights, it’s important to use row standardization, i.e divide each neighbor weight by the sum of ALL neighbor weights in that row:

If we know little about the assumed spatial process, it’s a good idea to stick with a fairly simple weighting systems (e.g. binary). In R, we assign these weights using the nb2listw function:

rstd <- nb2listw(rook)
print(rstd)
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 99 
## Number of nonzero links: 444 
## Percentage nonzero weights: 4.53015 
## Average number of links: 4.484848 
## 
## Weights style: W 
## Weights constants summary:
##    n   nn S0      S1       S2
## W 99 9801 99 46.6781 399.6922

Here, we’re taking our KNN-1 nb object, rook, and transforming it into a listw weights object. The default conversion is to standardized the weights for each areal entity to 1 (i.e. default is row standardization). When we print(rstd), we can read through information about the characteristics of the underlying weight structure. We can also extract the underlying nb object using the using rstd$neighbours.

We can check that our weights list is row standardized by quickly summing across all rows and hoping the outcome is one:

summary(sapply(rstd$weights, sum))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       1       1       1       1       1

We can also look at the distribution of our weights as follows:

summary(unlist(rstd$weights))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1429  0.1667  0.2000  0.2230  0.2500  0.5000

If you prefer a binary weights matrix, perhaps because you know very little about the spatial structure of your data, you can add style = B:

bin <- nb2listw(rook, style='B')

Another commonly-used weighting approach is inverse distance weighting. This assigns weights proportional to the inverse distance between points representing our polygons. We can use nbdists() to compute the distance for a given nb object. We can then use lapply() to invert these distances.

dists <- nbdists(rook, coordinates(cty))
idw <- lapply(dists, function(x) 1/(x/1000))

We can then use this idw list as our list of weights in the nb2listw() function:

idwlist <- nb2listw(rook, glist = idw)

Spatial autocorrelation

All of our work building weights matrices was necessary to compute spatial autocorrelation. We need to tell R how our polygons are connected in space.5 Once we establish spatial connections, we can use these connections to see how related the values of polygons are at different lags, or differences in the location between polygons.

Spatial autocorrelation captures the degree to which near and distant things are related. The idea emerged in the late 1800s when a Brit named Francis Galton started to question the independence of his data. He was interested in examining the correlations between marriage and social complexity. Galton realized that cultures are not independent, and that interactions across cultures prevented these cultures from being independent, in both a statistical and sociocultural sense. This became known as “Galton’s law” and was later formalized through the concept of spatial autocorrelation.

Why compute spatial autocorrelation in the first place? First, it’s interesting to see how observations of our variable of interest covary as a function of space. Second, one of the big assumptions in regression analysis is that observations in our sample are independent. If you think about all of our spatial data, it describes a process \(Y()\) that varies as a function of locations, i.e. \(Y(x, y)\). The locations \((x,y)\) are fixed and not random, but the attribute values \(Y\) are assumed to have a random component (i.e. a quantity that is sampled and whose variables are assumed to be distributed according to some probability distribution). Our sample observations are not independent if they are spatially autocorrelated (i.e. closer observations covary more than distant observations). When our observations are spatially autocorrelated, we have to restructure our residuals - we’ll learn about this next week.

We can think about any spatial dataset \(Y(x,y)\) as containing different components: a deterministic trend \(T(x,y)\), a spatially autocorrelated random process \(\eta(x,y)\) and a spatially uncorrelated random process \(\epsilon(x,y)\). Formally, we can think of this as follows:

\[ Y(x,y) = T(x,y) + \eta(x,y) + \epsilon(x,y) \]

In our corn example, general trends \(T(x,y)\) in our yield data may be driven by large-scale variations in temperature, precipitation, and soil quality.6 There are also smaller variations in the data that are spatially explicit \(\eta(x,y)\), or spatially autocorrelated random processes, such as slight topographic changes or clusters of management practices. There are also random variations in the data \(\eta(x,y)\) due to measurement error or any unmodeled variations that are not spatially structured.


Global Moran’s I

Testing for spatial autocorrelation is just like any hypothesis test where we test against some null hypothesis. In our case, the null is that the data are randomly assigned to locations. Just like a classic hypothesis test, we need to compute a test statistic (e.g. t-test) and the probability (e.g. p-value) of observing a value of this statistic at least as extreme as the one we actually observe. If the probability is low, we reject the null that the data are randomly arranged by location. In this class, we’ll use the Moran’s I statistic to test our hypothesis:

\[ I = \frac{n}{S_0} \frac{\sum_i \sum_j w_{ij} (Y_i - \bar Y)(Y_j - \bar Y)}{\sum_i (Y_i - \bar Y)^2} \]

where \(n\) is the number of spatial units indexed by \(i\) and \(j\), \(Y\) is the variable of interest and \(\bar Y\) is the mean of \(Y\), \(w_{ij}\) is our weights matrix with zeros on the diagonal, and \(S_0\) is the sum of all \(w_ij\), or \[ S_0 = \sum_i \sum_j w_{ij} \] Note here that centering on the mean is like asserting that the correct model has a constant mean across space (stationarity) and that any remaining patterning after centering is caused by the spatial relationships in the spatial weights.

Think of Global Moran’s I as a sort of correlation of a variable with a spatially lagged version of itself.7 It’s fairly simple to test Moran’s I in R. The moran.test() function from the spdep package takes your variable of interest (i.e. yield varying across counties) and your spatial weight object (a listw object generated as described above) and computes Moran’s I:

gmoran <- moran.test(cty_corn$Yield, listw = rstd, randomisation = T)
gmoran
## 
##  Moran I test under randomisation
## 
## data:  cty_corn$Yield  
## weights: rstd  
## 
## Moran I statistic standard deviate = 9.5424, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.630026082      -0.010204082       0.004501538

Remember, for Global Moran’s I, the null hypothesis is that the attribute being analyzed (corn yield in Iowa) is randomly distributed (randomisation = T) among the spatial features in our region of interest. This means that when the p-value returned by our test is significant, we can reject the null of randomly distributed data. Our results suggest that our data is spatially auto-correlated. We can also set the null that the attribute being analyzed is normally distributed (randomisation = F) among the spatial features in our region of interest:

gmoran <- moran.test(cty_corn$Yield, listw = rstd, randomisation = F)
gmoran
## 
##  Moran I test under normality
## 
## data:  cty_corn$Yield  
## weights: rstd  
## 
## Moran I statistic standard deviate = 9.4882, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.630026082      -0.010204082       0.004553101

As in classic hypothesis testing, if the p-value is not statistically significant, you can NOT reject the null that your data is the result of a random spatial process. If the p-value is statistically significant, you can reject the null.

The moran.test() function is ok, but the best way to compute Global Moran’s I is to use Monte Carlo simulation. The Monte Carlo method generates simulated data via random sampling. In this case, our Monte Carlo simulation randomly assigns values to the polygons in our cty shapefile. In each iteration of the simulation, Moran’s I is computed. This allows us to generate a distribution of expected Moran’s I values. We can compare these expected values for random data to our actual values to determine whether our data exhibits spatial autocorrelation.

moran_monte_carlo <- moran.mc(cty_corn$Yield, rstd, nsim = 99)
moran_monte_carlo
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  cty_corn$Yield 
## weights: rstd  
## number of simulations + 1: 100 
## 
## statistic = 0.63003, observed rank = 100, p-value = 0.01
## alternative hypothesis: greater

Here we ran 99 simulations (nsim = 99) using our cty_corn$Yield data and our rstd weights object. We can compare the randomly generated data to our actual data to determine whether our data exhibits spatial autocorrelation.

Remember that Global Moran’s I is a global test of spatial autocorrelation… a sort of average across multiple distances (or lags) between polygons within the region of interest. It can also be helpful to assess how spatial autocorrelation changes as a function of distance. A correlogram can be used to help us visualize autocorrelation as a function of distance.8 As distance between polygons (x-axis) increases, how does our spatial autocorrelation change (y-axis)? The sp.correlogram function from the spdep package is a great tool to quickly visualize changes in spatial autocorrelation at increasing spatial lags.

cgm <- sp.correlogram(rook, cty_corn$Yield, order = 8, method = "I")
plot(cgm, main = "Correlogram")

The lags here are computed as neighbor distances based on our rook object. We can specify that we want sp.correlogram() to compute Moran’s I at each lag by adding the argument method = "I". We can also compute Geary’s C (another global indicator of spatial autocorrelation) or simple correlation. Like the moran.test() function, this function assumes randomization rather than normality. There are several other packages you can use to compute correlograms. An overview and comparison can be found here.

Another way to visualize the Moran statistic is with a Moran scatterplot. This plot places our variable of interest on the x-axis (yield) and the spatially weighted sum of neighbor values (i.e. spatially lagged values) on the y axis. We can think of the global Moran’s I as a linear relationship between these (drawn as a slope in the figure).

moran.plot(cty_corn$Yield, listw = rstd)

So here we see a positive relationship between yield values and lagged, weighted yield values. moran.plot() calls the influence.measures() function to identify points that have a particularly strong influence on our predicted slope. Those points are indicated with numbers above. Looking at the moran.plot() scatterplot can help us detect strange observations or clusters of strange observations that affect our overall detection of spatial autocorrelation.


Additional resources


  1. These are discussed in great detail on pages 100-104 of Spatial Data Analysis in Ecology and Agriculture Using R by Richard E. Plant.

  2. Go to the NASS QuickStats page, select Sector::Crops, Group::Field Crops, Commodity::Corn, Category::Yield, Data Item::Corn, Grain Yield measured in Bu/Acre, broken down into irrigated and non-irrigated. I like this metric as it seems to capture the productivity of cultivated land. I selected Geographic Level::County, State::Iowa, County::All, Year::2016, Period::Annual.

  3. If you’re interested in how county boundaries have moved through time, check out this COOLEST TOOL EVER. You can also take a second to ponder the political implications of changing county boundaries (ahem, gerrymandering). But that’s another class for another day.

  4. In fact, if you really want to ponder things, MEASUREMENT in general radically changes things, even at a quantum level.

  5. Much of the material in this section based on an excellent overview provided by Richard Plant in the Spatial Data Analysis in Ecology and Agriculture Using R text.

  6. We can and should detrend our data prior to estimating Moran’s I. We’ll learn about how to do this next week, but for now, know that you can estimate Moran’s I on the residuals of a linear regression using lm.morantest().

  7. You can also compute bivariate Moran’s statistics, i.e. the correlation of a lagged variable \(y\) with another variable \(x\)

  8. Similar to a variogram which we’ll learn about in future classes.