Objectives


Pre-lab assignments


Set up

Yeah, lots of packages this week. I tried to find the most efficient/elegant tools in the universe of spatial packages in R, which led me to several different packages.

library(maptools)
library(maps)
library(rgdal)
library(sp)
library(spdplyr)
library(raster)
library(rgeos)

My assumptions


Our data

Today we’ll be joining data collected at a hydrological scale (watershed) to data collected at an administrative scale (county). We’ll be working with a county-level shapefile of the entire US and with the Watershed Boundary Dataset (WBD). Our objective is to join counties to the larger hydrologic unit of which they are a part. The US Geological Survey (USGS) has tons of hydrological data available for you to play with. The WBD contains the boundaries of USGS “hydrologic units” (HU) at multiple scales including region (HU2), sub-region (HU4), basin (HU6), sub-basin (HU8), watershed (HU10) and sub-watershed (HU12). Each HU has a HU Code or HUC that uniquely identifies that water entity. It neat to think about your location in the world not as a member of a town, county, state, and nation, but also as a member of a region, sub-region, basin, watershed, etc. For example, here in Logan, Utah, we are members of:

If you’re not in Logan, you can find out your hydro-citizenry here. If you want an even more detailed version of your hydro-citizenry, check out this amazing, but 3 gigabyte dataset.

Ok, back to the objective. I have a map of national sub-regions (HU 4). I want to determine which counties are located within each region (HU 2). Luckily, some of the HU 2 data is contained in the HU 4 data set. Let’s begin by looking at our data:

cty <- readRDS("./data/county.RDS")
ws <- readRDS("./data/wdb_sub.RDS")

To help facilitate data transfer, I originally loaded the shapefile (.shp) versions of each of these files into R and then saved it as an RDS file (which preserves the R-specific sp structure of the data) using saveRDS(). This is why you’re able to load this file into R using the readRDS() function.

First, always be sure to check the projections of your data:

cty@proj4string
## CRS arguments:
##  +proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0
ws@proj4string
## CRS arguments:
##  +proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0

Data is in the same projection - we should be good to go!


Working with sp objects

Before we begin, I want to review a few useful things you can do with sp objects in R. First, it’s fairly easy to create a new shapefile that is subset of the original dataset. Let’s say I want a shapefile with only HU4 watersheds located in our sub-region of the world, the Bear River sub-region. To find these, I need to know that the HU4 ID for the Bear River sub-region is 01 and then I can simply call:

great_basin <- ws[ws$HUC4 == "01",]
class(great_basin)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
head(great_basin@data)
##     OBJECTID                                  TNMID METASOURCE SOURCEDATA
## 176      176 {7492C028-E858-42FB-848C-A49CB81CEE4B}       <NA>       <NA>
##     SOURCEORIG SOURCEFEAT   LOADDATE GNIS_ID AREAACRES AREASQKM   STATES
## 176       <NA>       <NA> 2012/06/11       0   4809301 19462.57 ID,UT,WY
##     HUC4 NAME Shape_Leng Shape_Area HUC2
## 176   01 Bear   12.23277   2.111604   16
plot(great_basin)

Or say we want to make a shapefile with only one of the attributes in our larger ws object:

ws_sub <- ws[, 'HUC4']
class(ws_sub)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"

Again, what’s nice here is that you can apply simple indexing to a fairly complex spatial object to subset based on row and column values. You can also use base R to add a variable to a spatial object:

ws$ID <- seq(1, nrow(ws))
head(ws$ID)
## [1] 1 2 3 4 5 6

Dissolve polygons

The next step is to dissolve the smaller HUC4 polygons into larger HUC2 polygons. To do this, we can use a nifty tool from the maptools package called unionSpatialPolygons() to create a coarser resolution HUC2 shapefile from the higher resolution HUC4 dataset.

# note that the union may take a second to run
library(maptools)
huc2 <- unionSpatialPolygons(SpP = ws, IDs = ws$HUC2)
plot(huc2, border="red")

Wow. This is cool. Notice, however, that the new object we built lost the attributes in our original higher resolution data and is now just a SpatialPolygon. R can handle dissolve polygons, but it doesn’t know how to dissolve the values in the attribute table unless we tell it how to (i.e. sum, mean, or some other function). This is where the aggregate() function from the sp package comes in. I like this function because it allows you to apply a function to the objects you’re aggregating. This would be useful if you wanted to, say, compute the total area of the new HUC2 watershed by summing the area of the HUC4 watersheds dissolved into the larger HUC2 area. Note that whatever function you use in aggregate() must return a single value which will be associated with the larger polygon.

watershed_size <- aggregate(ws["AREAACRES"], huc2, FUN = sum)
watershed_size@data$AREAACRE
## [1] 90699794

Finally, there’s a nifty tool in the rgeos package that lets you dissolve all polygons in a shapefile into a large shapefile (i.e. all counties in the US into the boundary of the US):

library(rgeos)
boundary <- gUnaryUnion(cty)
plot(boundary)


Spatial Joins

In many cases, you have several spatial datasets you want to connect in some way. If you’re lucky, you have unique identifiers in each dataset that are formatted exactly the same and can be used to join the two datasets using only an attribute join and the merge() function (see Attribute Join section below). In many cases, however, you’ll only have the spatial location of the two datasets. Maybe you want to link a point shapefile documenting Big Foot sigtings to a raster describing land use. Maybe you want to do something a bit more mundane like compute the amount of roads within a county. Or perhaps you want to extract remotely sensed metrics of temperature or vegetation health to points in space. You can do this using spatial joins. I’ll walk you through a few examples below, but know that each spatial join is special, i.e. full of it’s own challenges and particularities.1 In general, we’ll be working with the over() function for spatial joins. over(x,y) does the following:

Here, correspondence means that the two spatial features intersect (touch, overlap, cover, include, etc.). If a feature of x contains matches of multiple features (e.g. points) in y, then over(x,y) returns an arbitrary first match. If you want ALL features of y that fall within each shape x, then add over(x, y, returnList = TRUE). You can also use over() to apply a function to all instances of y within each instance of x, i.e. compute the mean population of cities within a county shapefile. You can do this by passing a function to over(), i.e. over(x, y, fun = mean()). Below, I’ll work through some examples.

Extract point data to polygons

Suppose you have point data indicating the location of cities in the US. You’re interested in identifying watersheds with a high number of cities; you hypothesize that human activities in these watersheds may impact water quality and want to test this hypothesis. You start by finding a shapefile of US cities (R and the maps package makes this easy):

library(maps)
data(us.cities) #loads dataset from maps package
head(us.cities)
##         name country.etc    pop   lat    long capital
## 1 Abilene TX          TX 113888 32.45  -99.74       0
## 2   Akron OH          OH 206634 41.08  -81.52       0
## 3 Alameda CA          CA  70069 37.77 -122.26       0
## 4  Albany GA          GA  75510 31.58  -84.18       0
## 5  Albany NY          NY  93576 42.67  -73.80       2
## 6  Albany OR          OR  45535 44.62 -123.09       0

Notice that this is a data.frame and not a sp object. To sp this thing, we need to use the SpatialPointsDataFrame() function and input coordinates, data, and projection information.

coords <- cbind(us.cities$long, us.cities$lat)
us.cities <- SpatialPointsDataFrame(coords = coords, data = us.cities, proj = huc2@proj4string)
# note that I was lazy and did NOT check the datum for us.cities.  You'd want to do this to correctly set the proj parameter
plot(us.cities)

Let’s crop this dataset to only include the cities within the boundaries of our original ws watershed object. First check projections (identical() is a cool function to do that quickly) reproject as necessary, and then crop!

identical(huc2@proj4string, us.cities@proj4string)
## [1] TRUE
city_crop <- crop(us.cities, huc2)
plot(city_crop)
plot(ws, add=T)