Objectives


Pre-lab assignments


Set-up

library(gstat)
library(tidyverse)
library(sp)
library(raster)

Our data

I’m using the precipitation data featured in the RSpatial pre-lab reading assignment. You can find the dataset here.

precip <- read.table("./data/precip.txt", sep = ",", header=T)
glimpse(precip)
## Observations: 456
## Variables: 17
## $ ID   <fctr> ID741, ID743, ID744, ID753, ID754, ID758, ID760, ID762, ...
## $ NAME <fctr> DEATH VALLEY, THERMAL/FAA AIRPORT, BRAWLEY 2SW, IMPERIAL...
## $ LAT  <dbl> 36.47, 33.63, 32.96, 32.83, 33.28, 32.82, 32.76, 33.74, 3...
## $ LONG <dbl> -116.87, -116.17, -115.55, -115.57, -115.51, -115.67, -11...
## $ ALT  <int> -59, -34, -31, -18, -18, -13, -9, -6, 2, 2, 3, 3, 3, 3, 4...
## $ JAN  <dbl> 7.4, 9.2, 11.3, 10.6, 9.0, 9.8, 9.0, 16.7, 106.3, 89.5, 6...
## $ FEB  <dbl> 9.5, 6.9, 8.3, 7.0, 8.0, 1.6, 7.0, 12.1, 107.1, 88.3, 65....
## $ MAR  <dbl> 7.5, 7.9, 7.6, 6.1, 9.0, 3.7, 5.0, 9.2, 72.9, 72.4, 49.6,...
## $ APR  <dbl> 3.4, 1.8, 2.0, 2.5, 3.0, 3.0, 1.0, 2.2, 32.1, 30.1, 22.5,...
## $ MAY  <dbl> 1.7, 1.6, 0.8, 0.2, 0.0, 0.4, 1.0, 1.3, 7.6, 2.0, 4.3, 1....
## $ JUN  <dbl> 1.0, 0.4, 0.1, 0.0, 1.0, 0.0, 0.0, 0.2, 2.2, 1.1, 1.3, 0....
## $ JUL  <dbl> 3.7, 1.9, 1.9, 2.4, 8.0, 3.0, 2.0, 3.5, 0.6, 0.6, 0.4, 3....
## $ AUG  <dbl> 2.8, 3.4, 9.2, 2.6, 9.0, 10.8, 9.0, 6.5, 0.6, 0.5, 1.7, 7...
## $ SEP  <dbl> 4.3, 5.3, 6.5, 8.3, 7.0, 0.2, 8.0, 6.4, 5.3, 1.4, 6.2, 8....
## $ OCT  <dbl> 2.2, 2.0, 5.0, 5.4, 8.0, 0.0, 8.0, 3.8, 11.4, 11.6, 6.0, ...
## $ NOV  <dbl> 4.7, 6.3, 4.8, 7.7, 7.0, 3.3, 7.0, 7.3, 47.8, 45.3, 33.6,...
## $ DEC  <dbl> 3.9, 5.5, 9.7, 7.3, 9.0, 1.4, 11.0, 7.4, 63.7, 58.0, 42.5...

Our dataset includes:

Let’s inspect this dataset by visualizing total annual precipitation across stations:

p <- precip %>% mutate(ANNUAL = rowSums(.[6:17])) %>% arrange(ANNUAL)
plot(p$ANNUAL, xlab = "Station ID", ylab = "Annual precipitation (mm)")

The bulk of our weather stations show annual precipitation under 500 millimeters (~20 inches in a year! And CA is where most of our food is grown… !?), with a few stations receiving higher precipitation. Seen another way:

hist(p$ANNUAL, 100)

It may also be interesting to look at how altitude and precipitation relate… here’s a quick visualization of the relationship between the two:

plot(p$ALT, p$ANNUAL, xlab = "Altitude (m above SL)", ylab = "Precipitation (mm)")

Huh, a bit of a mess. Let’s see if we can visually detect spatial patterns in our data. First, I coerce our data.frame into a SpatialPointsDataFrame, then I visualize annual precipitation variations across stations using spplot() (yep, you can also use this great function for point data!):

psp <- SpatialPoints(cbind(p$LONG, p$LAT), proj4string=CRS("+proj=longlat +datum=NAD83")) # proj info taken from RSpatial website
psp <- SpatialPointsDataFrame(psp, p)
psp <- spTransform(psp, CRS("+proj=aea +lat_1=34 +lat_2=40.5 +lat_0=0 +lon_0=-120 +x_0=0 +y_0=-4000000 +ellps=GRS80 +datum=NAD83 +units=m +no_defs"))
spplot(psp, "ANNUAL")

It does look like there are clear spatial patterns in precipitation across CA. Given what you know about the state, does this make sense? Imagine that from this collection of point estimates you want to generate a gridded dataset that estimates precipitation in areas between our weather station. We can use the spatial relatedness of our data to perform an interpolation of our data to do just that! We’ll start with a simple, fairly ad hoc approach to interpolation called inverse distance weighted (IWD) interpolation and then focus on the art of kriging.


Inverse distance weighted (IDW) interpolation

So far, we’ve thought of spatial autocorrelation as a bad thing… a nuisance to be extracted and controlled for in our models. Since our spatial autocorrelation estimates tells us the extent to which observation values are related in space, we can also use spatial autocorrelation to predict values in space where no measurements have been made. This process is generally known as interpolation.

You might be thinking “well why can’t we just use the kernel density estimates we learned about during our point-pattern analysis overview?” Good question. In the case of our CA data, if we ran a kernel density estimate on the points, we would be estimating the density of weather stations in CA. Instead, what we want to do is interpolate the values of precipitation at these stations. To do this, we need to estimate the values of precipitation in areas where, hypothetically, precipitation could have been measured but was not actually measured. The interpolation techniques we discuss in this tutorial are great ways to do just that!

We’ll start with one of the simpler approaches to interpolation: inverse distance weighted interpolation. This approach assumes that things that are close together are more similar than those that are far apart. IDW measures values at unmeasured locations using the values of the nearest surrounding measured locations, where each point has an influence on the unmeasured value that diminishes as a function of distance:

\[ \hat{Y}(x) = \sum_{i=1}^{n} \phi_i Y(x_i), \\ \phi_i = \frac{d(x, x_i)^{-p}}{\sum_{i=1}^n d(x, x_i)^{-p}} \]

The two parameters of interest are \(p\) and \(n\). \(n\) is the number of neighbors considered in the IDW interpolation; the higher the \(n\) the more values that contribute to the interpolation. Increasing \(p\), or the power, reduces the influence of more distant neighbors relative to closer ones. Most software uses a default \(p\) value of \(2\).

The first thing we’ll need to do our IDW interpolation is a grid on which to interpolate the data.

ca <- readRDS("./data/counties.rds")
ca <- spTransform(ca, CRS("+proj=aea +lat_1=34 +lat_2=40.5 +lat_0=0 +lon_0=-120 +x_0=0 +y_0=-4000000 +ellps=GRS80 +datum=NAD83 +units=m +no_defs"))
grd <- SpatialPixels(SpatialPoints(makegrid(ca, n=2900)), proj4string = proj4string(ca))
plot(grd)

A grid! You can play with the size of each cell in the grid by adjusting n, or the number of cells in our grid. Let’s subset this to the extent of the ca shapefile so we don’t interpolate where we have no data:

grd <- grd[ca,] 
plot(grd)

Note that this is NOT a raster object, but a sp SpatialPixels object. Our kriging and interpolation functions (in the gstat package) like both the observations (precipitation) and the grid to which we’ll interpolate our observations (grd) to be sp objects. We can access the resolution and projection of our grd SpatialPixels object as follows:

grd@grid@cellsize
##    x1    x2 
## 18000 18000
grd@proj4string
## CRS arguments:
##  +proj=aea +lat_1=34 +lat_2=40.5 +lat_0=0 +lon_0=-120 +x_0=0
## +y_0=-4000000 +ellps=GRS80 +datum=NAD83 +units=m +no_defs
## +towgs84=0,0,0

Here, we see that our cells are approximately 18000 meters wide, or 18 km. This seems like a reasonable resolution for precipitation estimates. If you happen have a raster that has the specific extent and resolution on which you’d like to interpolate, you can also simply transform that raster to a SpatialGrid object like so: new_grid <- as(old_raster, "SpatialGrid"). If these two approaches don’t work for you, you can read more about creating grids here.

Once we’ve built the grid, the interpolation is fairly straight-forward:

p.idw <- idw(formula = ANNUAL ~ 1, psp, grd,  idp = 2, nmax = 12)
## [inverse distance weighted interpolation]
plot(p.idw)

You set values of \(p\) and \(n\) with the idp and nmax arguments respectively. Here, I’ve used fairly standard values of 2 and 12, but in your analyses, test the sensitivity of your results to changing values of \(p\) and \(n\). I have also specified the formula ANNUAL ~ 1 which basically tells R I am interested in how the values of annual precipitation interactive with themselves across space. I also fed the idw() function our data (psp) and our grid (grd). The function returns a SpatialPixelsDataFrame with our interpolated results!


Kriging

IDW applies interpolates based on our \(p\) and \(n\) specifications in a uniform manner across the region of interest. Kriging creates a variogram to compute the weights of neighboring points based on the distribution of those values, i.e. kriging lets the localized pattern of points drive the interpolation process rather than applying a set of uniform assumptions to the region of interest.

Kriging (pronounced “kree-ging”) is a family of generalized linear regression techniques in which the value of a property at an unsampled location is estimated from values at neighboring locations. The word comes from the name of a South African miner (D.G. Krige) who developed the technique.

Sample variogram

To perform kriging, you have to first calculate a sample variogram, or the variogram from your data. Then you find a model variogram that best fits your empirical, sample variogram. Wait, but what’s a variogram? A variogram visualizes spatial relatedness as a function of lag, or \(h\). The lag is the difference between the locations of two measurements. This should remind you of the correlograms we computed a few weeks ago. The two are very similar, but converse to the correlogram, high values on the y-axis indicate low spatial autocorrelation, while low values indicate high spatial autocorrelation.

We will initially assume our temperature data are isotropic, or that our variogram properties are independent of direction. We will then look at anisotropy in our data, or whether our variogram properties depend on both lag (\(h\)) and direction.

If we assume that the process that generated our sample is a random function \(Z(s)\) composed of a mean and a residual, \(Z(s) = m + e(s)\) with a constant mean \(E(Z(s)) = m\). In this case, the variogram is defined as:

\[ \gamma(h) = \frac{1}{2} E(Z(s) + Z(s+h))^2 \]

Under the assumption listed above, we state that the variance of \(Z\) is constant and that the spatial correlation of \(Z\) does not depend on location \(s\) but only on separation distance \(h\). A wider class of models is obtained with the mean varies spatially and can, for example, be modeled as a function of known predictors \(X_j(s)\) as in:

\[ Z(s) = \sum_{j=0}^p X_j (s) \beta_j + e(s) = X\beta + e(s) \]

where \(X_j(s)\) is the spatial regressors and \(\beta_j\) unknown regression coefficients. For varying mean models, stationarity properties refer to the residual \(e(s)\) and the sample variogram needs to be computed from estimated residuals. In this lab, we’ll continue with the simpler model in which the mean does not vary spatially.

The variogram() function from gstat can be used to compute the sample variogram. This computes and plots all possible squared differences of observation pairs \((Z(s_i) -Z(s_j))^2\) against their separation distance \(h_{ij}\).

vg <- variogram(ANNUAL ~ 1, psp)
plot(vg)

If you’re worried about outlier observations, add cressie=T to the variogram function, which computes the robust measures for the sample variogram

p.vgm <- variogram(ANNUAL ~ 1, psp, cressie=T)
plot(vg)

You can interpret a variogram a bit like a correlogram, only the values of Semi variance are “flipped”. Here, higher values of semivariance indicate that observations are less related, so as we would expect, as distance increases, our semivariance increases. When you look at your variogram, you should look for the following:

  • The nugget is the distance between (0,0) and where your variogram intersects the y-axis.
  • The sill and partial sill represent the limits of the variogram. Think of these as the \(y\) values at which the variogram starts to plateau.
  • The range is the distance at which the variogram approaches the sill value.

The variogram() function makes a number of default decisions, notably that point pairs are merged on the basis of distance, not direction. You can look at the variogram across different cardinal directions as follows:

vgm.dir<-variogram(ANNUAL~1,data=psp,alpha=c(0,45,90,135))
plot(vgm.dir)

This divides directions over their principles direction - e.g.. any point pair between 22.5 degrees and 67.5 is used for the 45 degree panel. You can pass finer subdivisions to the alpha argument. You can also change the default cutoff distance, or the maximum distance at which the variogram is estimated. gstat uses a default that’s 1/3 of the largest diagonal of the bounding box, so if you’re working with big datasets, it can be helpful to specify a cutoff:

p.vgm.cutoff <- variogram(ANNUAL ~ 1, psp, cutoff = 100000)
plot(p.vgm.cutoff)

Note that the distances is in METERS, which are the units of our data’s projection:

psp@proj4string
## CRS arguments:
##  +proj=aea +lat_1=34 +lat_2=40.5 +lat_0=0 +lon_0=-120 +x_0=0
## +y_0=-4000000 +ellps=GRS80 +datum=NAD83 +units=m +no_defs
## +towgs84=0,0,0

Model variogram

Once we’ve used variogram() to visualize the variogram constructed from our actual observations, we need to fit a model to this sample variogram which we will then use to krige. There are a number of types of model variograms available in gstat:

show.vgms()

and

vgm()
##    short                                      long
## 1    Nug                              Nug (nugget)
## 2    Exp                         Exp (exponential)
## 3    Sph                           Sph (spherical)
## 4    Gau                            Gau (gaussian)
## 5    Exc        Exclass (Exponential class/stable)
## 6    Mat                              Mat (Matern)
## 7    Ste Mat (Matern, M. Stein's parameterization)
## 8    Cir                            Cir (circular)
## 9    Lin                              Lin (linear)
## 10   Bes                              Bes (bessel)
## 11   Pen                      Pen (pentaspherical)
## 12   Per                            Per (periodic)
## 13   Wav                                Wav (wave)
## 14   Hol                                Hol (hole)
## 15   Log                         Log (logarithmic)
## 16   Pow                               Pow (power)
## 17   Spl                              Spl (spline)
## 18   Leg                            Leg (Legendre)
## 19   Err                   Err (Measurement error)
## 20   Int                           Int (Intercept)

A good first step is to look at your sample variogram, determine what looks like a reasonable choice. I’m going to start with a spherical model, Sph:

# fit model to sample
p.fit <- fit.variogram(p.vgm, model = vgm(psill = 200000, "Sph", range = 400000, nugget = 50000))
plot(p.vgm, pch = 20, cex = 1.5, col = "black", ylab = "Semivariance", xlab = "Distance (m)", model = p.fit)

fit.variogram() takes your sample variogram and a vgm() object you need to construct. This should include an estimate of the psill, or partial sill, the range, and the nugget. These can and should come from your plots of your sample variogram. Hmmm, this fit is ok, but it looks like the Spherical shape may not be best. Let’s try a Matern model:

p.fit <- fit.variogram(p.vgm, model = vgm(psill = 200000, "Mat", range = 400000, nugget = 50000))
plot(p.vgm, pch = 20, cex = 1.5, col = "black", ylab = "Semivariance", xlab = "Distance (m)", model = p.fit)

Let’s try a different parameterization of a Matern model, Ste:

p.fit <- fit.variogram(p.vgm, model = vgm(psill = 200000, "Ste", range = 400000, nugget = 25000))
plot(p.vgm, pch = 20, cex = 1.5, col = "black", ylab = "Semivariance", xlab = "Distance (m)", model = p.fit)

These Matern models seem to be doing a good job. I’ll use the last p.fit object we created for kriging! When you’re fitting your modeled variogram, play with the different model specifications until the fit looks appropriate for your data. You can change the actual model (e.g. Ste versus Mat) and also the psill, range, and nugget parameters based on what your actual data looks like.

Once you’ve built a model variogram you like, you can use the krige() function to do the actual interpolation. krige() requires that you specify the variable of interest you want to model (ANNUAL~1), the original input data (psp), the grid to which you want to krige this data (grd), and the model variogram you created (p.fit).1

p.krig <- krige(ANNUAL ~ 1, psp, grd, model = p.fit)
## [using ordinary kriging]
plot(p.krig)

Nice! We have now estimated precipitation in areas where precipitation was not estimated using a kriging model!

If you plan to krige your own data, here are a few reminders and tips. For ordinary kriging to apply to your data, the following assumptions need to hold:

  1. Observations are a partial realization of a random function \(Z(x)\) where \(x\) denotes spatial location.
  2. That random function is second-order stationary, so the mean, spatial covariance and semivariance don’t depend on the location \(x\).

Point 2 can be tricky, so you may need to detrend your data using regression or other tools PRIOR to kriging your data.

A few programming tips if you plan to krige your own data:

  • Check for duplicates in your data! Kriging won’t work if you have multiple observations for the same point (unless you’re doing space-time kriging)
  • HOLD OUT some of your data before you krig to check goodness of fit.
  • Make sure your data is projected.

Assessing fit

To assess the fit of our interpolated or kriged data, it’s a good idea to “hold out” some of the data prior to running the interpolation/krige. In the case of our data, this means we remove some weather stations from the dataset we use to krige/interpolate and see how well our modeled results predict the known values of precipitation at the held out stations. We can do this first by selecting a random subset of the data:

random_rn <- sample(nrow(psp@data), 50)
psp_HO <- psp[random_rn,]
psp_sample <- psp[-random_rn,]

Next, we krige with the subset of our data:

p.krig <- krige(ANNUAL ~ 1, psp_sample, grd, model = p.fit)
## [using ordinary kriging]
plot(p.krig)

Next we need to compute the root mean squared error (RMSE), which is computed as follows:

RMSE <- function(observed, predicted) {
  sqrt(mean((predicted - observed)^2, na.rm=TRUE))
}

The RMSE gives us an indicating of how far our predicted values are from our actual observed values. To get our predicted values at each point in our held out data, let’s extract the values from the kriged dataset to our psp_HO points. Then we simply compare the observed and predicted values using our RMSE function!

library(raster)
pred_precip <- raster::extract(raster(p.krig), psp_HO, sp=T)

This pred_precip object now contains our original precipitation observations (ANNUAL) and our extracted precipitation predictions at these stations (var1.pred). Let’s compute our RMSE:

rmse_precip <- RMSE(observed = pred_precip$ANNUAL, predicted = pred_precip$var1.pred)
rmse_precip
## [1] 151.4203

This is a very high RMSE which suggests our model isn’t stellar. You can compare RMSE across models to figure out which model best predicts your data. A lower RMSE indicates a better fitting model. If we compare our model to a null model where we just use mean precipitation to predict precipitation, you’ll see that we’re doing much better (whew!)

null <- RMSE(mean(pred_precip$ANNUAL), pred_precip$var1.pred)
null
## [1] 341.3516

Additional Resources


  1. We can also use our friend spplot() to visualize both the predicted values and the variance in predicted values across space, simply call spplot(p.krig).