Objectives


Pre-lab assignments


Set up

Yeah, lots of packages this week. I tried to find the most efficient/elegant tools in the universe of spatial packages in R, which led me to several different packages.

library(maptools)
library(maps)
library(rgdal)
library(sp)
library(spdplyr)
library(raster)
library(rgeos)

My assumptions


Our data

Today we’ll be joining data collected at a hydrological scale (watershed) to data collected at an administrative scale (county). We’ll be working with a county-level shapefile of the entire US and with the Watershed Boundary Dataset (WBD). Our objective is to join counties to the larger hydrologic unit of which they are a part. The US Geological Survey (USGS) has tons of hydrological data available for you to play with. The WBD contains the boundaries of USGS “hydrologic units” (HU) at multiple scales including region (HU2), sub-region (HU4), basin (HU6), sub-basin (HU8), watershed (HU10) and sub-watershed (HU12). Each HU has a HU Code or HUC that uniquely identifies that water entity. It neat to think about your location in the world not as a member of a town, county, state, and nation, but also as a member of a region, sub-region, basin, watershed, etc. For example, here in Logan, Utah, we are members of:

If you’re not in Logan, you can find out your hydro-citizenry here. If you want an even more detailed version of your hydro-citizenry, check out this amazing, but 3 gigabyte dataset.

Ok, back to the objective. I have a map of national sub-regions (HU 4). I want to determine which counties are located within each region (HU 2). Luckily, some of the HU 2 data is contained in the HU 4 data set. Let’s begin by looking at our data:

cty <- readRDS("./data/county.RDS")
ws <- readRDS("./data/wdb_sub.RDS")

To help facilitate data transfer, I originally loaded the shapefile (.shp) versions of each of these files into R and then saved it as an RDS file (which preserves the R-specific sp structure of the data) using saveRDS(). This is why you’re able to load this file into R using the readRDS() function.

First, always be sure to check the projections of your data:

cty@proj4string
## CRS arguments:
##  +proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0
ws@proj4string
## CRS arguments:
##  +proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0

Data is in the same projection - we should be good to go!


Working with sp objects

Before we begin, I want to review a few useful things you can do with sp objects in R. First, it’s fairly easy to create a new shapefile that is subset of the original dataset. Let’s say I want a shapefile with only HU4 watersheds located in our sub-region of the world, the Bear River sub-region. To find these, I need to know that the HU4 ID for the Bear River sub-region is 01 and then I can simply call:

great_basin <- ws[ws$HUC4 == "01",]
class(great_basin)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
head(great_basin@data)
##     OBJECTID                                  TNMID METASOURCE SOURCEDATA
## 176      176 {7492C028-E858-42FB-848C-A49CB81CEE4B}       <NA>       <NA>
##     SOURCEORIG SOURCEFEAT   LOADDATE GNIS_ID AREAACRES AREASQKM   STATES
## 176       <NA>       <NA> 2012/06/11       0   4809301 19462.57 ID,UT,WY
##     HUC4 NAME Shape_Leng Shape_Area HUC2
## 176   01 Bear   12.23277   2.111604   16
plot(great_basin)

Or say we want to make a shapefile with only one of the attributes in our larger ws object:

ws_sub <- ws[, 'HUC4']
class(ws_sub)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"

Again, what’s nice here is that you can apply simple indexing to a fairly complex spatial object to subset based on row and column values. You can also use base R to add a variable to a spatial object:

ws$ID <- seq(1, nrow(ws))
head(ws$ID)
## [1] 1 2 3 4 5 6

Dissolve polygons

The next step is to dissolve the smaller HUC4 polygons into larger HUC2 polygons. To do this, we can use a nifty tool from the maptools package called unionSpatialPolygons() to create a coarser resolution HUC2 shapefile from the higher resolution HUC4 dataset.

# note that the union may take a second to run
library(maptools)
huc2 <- unionSpatialPolygons(SpP = ws, IDs = ws$HUC2)
plot(huc2, border="red")

Wow. This is cool. Notice, however, that the new object we built lost the attributes in our original higher resolution data and is now just a SpatialPolygon. R can handle dissolve polygons, but it doesn’t know how to dissolve the values in the attribute table unless we tell it how to (i.e. sum, mean, or some other function). This is where the aggregate() function from the sp package comes in. I like this function because it allows you to apply a function to the objects you’re aggregating. This would be useful if you wanted to, say, compute the total area of the new HUC2 watershed by summing the area of the HUC4 watersheds dissolved into the larger HUC2 area. Note that whatever function you use in aggregate() must return a single value which will be associated with the larger polygon.

watershed_size <- aggregate(ws["AREAACRES"], huc2, FUN = sum)
watershed_size@data$AREAACRE
## [1] 90699794

Finally, there’s a nifty tool in the rgeos package that lets you dissolve all polygons in a shapefile into a large shapefile (i.e. all counties in the US into the boundary of the US):

library(rgeos)
boundary <- gUnaryUnion(cty)
plot(boundary)


Spatial Joins

In many cases, you have several spatial datasets you want to connect in some way. If you’re lucky, you have unique identifiers in each dataset that are formatted exactly the same and can be used to join the two datasets using only an attribute join and the merge() function (see Attribute Join section below). In many cases, however, you’ll only have the spatial location of the two datasets. Maybe you want to link a point shapefile documenting Big Foot sigtings to a raster describing land use. Maybe you want to do something a bit more mundane like compute the amount of roads within a county. Or perhaps you want to extract remotely sensed metrics of temperature or vegetation health to points in space. You can do this using spatial joins. I’ll walk you through a few examples below, but know that each spatial join is special, i.e. full of it’s own challenges and particularities.1 In general, we’ll be working with the over() function for spatial joins. over(x,y) does the following:

Here, correspondence means that the two spatial features intersect (touch, overlap, cover, include, etc.). If a feature of x contains matches of multiple features (e.g. points) in y, then over(x,y) returns an arbitrary first match. If you want ALL features of y that fall within each shape x, then add over(x, y, returnList = TRUE). You can also use over() to apply a function to all instances of y within each instance of x, i.e. compute the mean population of cities within a county shapefile. You can do this by passing a function to over(), i.e. over(x, y, fun = mean()). Below, I’ll work through some examples.

Extract point data to polygons

Suppose you have point data indicating the location of cities in the US. You’re interested in identifying watersheds with a high number of cities; you hypothesize that human activities in these watersheds may impact water quality and want to test this hypothesis. You start by finding a shapefile of US cities (R and the maps package makes this easy):

library(maps)
data(us.cities) #loads dataset from maps package
head(us.cities)
##         name country.etc    pop   lat    long capital
## 1 Abilene TX          TX 113888 32.45  -99.74       0
## 2   Akron OH          OH 206634 41.08  -81.52       0
## 3 Alameda CA          CA  70069 37.77 -122.26       0
## 4  Albany GA          GA  75510 31.58  -84.18       0
## 5  Albany NY          NY  93576 42.67  -73.80       2
## 6  Albany OR          OR  45535 44.62 -123.09       0

Notice that this is a data.frame and not a sp object. To sp this thing, we need to use the SpatialPointsDataFrame() function and input coordinates, data, and projection information.

coords <- cbind(us.cities$long, us.cities$lat)
us.cities <- SpatialPointsDataFrame(coords = coords, data = us.cities, proj = huc2@proj4string)
# note that I was lazy and did NOT check the datum for us.cities.  You'd want to do this to correctly set the proj parameter
plot(us.cities)

Let’s crop this dataset to only include the cities within the boundaries of our original ws watershed object. First check projections (identical() is a cool function to do that quickly) reproject as necessary, and then crop!

identical(huc2@proj4string, us.cities@proj4string)
## [1] TRUE
city_crop <- crop(us.cities, huc2)
plot(city_crop)
plot(ws, add=T)

A lot of nothing in the middle of this watershed, huh? But what appears to be the Wasatch Front definitely pops out! It would be helpful to know which cities are in each smaller watershed. To return a list of large cities in each watershed, let’s use our new ’over()` function:

city_list <- over(ws, us.cities, returnList=T)
city_list$`26`
##               name country.etc    pop   lat    long capital
## 144 Carson City NV          NV  58350 39.15 -119.74       2
## 746        Reno NV          NV 206626 39.54 -119.82       0
## 858      Sparks NV          NV  88518 39.54 -119.74       0

This returns a list of the cities in each HUC2 polygon. Even cooler, you can apply functions to all points within a HUC polygon using the fn argument. Want to know the average population within the HUC2 region?

avg_pop <- over(ws, us.cities[,"pop"], fn = mean)
avg_pop
##           pop
## 26  117831.33
## 71   44614.00
## 88         NA
## 143        NA
## 174  89596.55
## 175  45262.00

What about the total population?

total_pop <- over(ws, us.cities[,"pop"], fn = sum)
total_pop
##        pop
## 26  353494
## 71   44614
## 88      NA
## 143     NA
## 174 985562
## 175  45262

You can also write your own crazy functions and apply them to the data!

Extract rasters to polygons

Another useful thing to know how to do is to extract data from a raster to a polygon. Let’s say you’re working with some CropScape data from Oconee County, South Carolina (a lovely place, y’all):

oconee <- raster("./data/CDL_2016_45073.tif")
plot(oconee)

…and imagine you’d like to compute the percent of land in the county covered in Open Water (code 111)2. Using the extract() function in the raster package, you can pull this information out to the county-level (impressed?)3:

extract <- raster::extract #tell R that when I call extract, I'm talking about the raster function

# extract OC shapefile from the larger county dataset
oconee_shp <- cty[cty$GEOID == "45073",]

# reproject oconee shape to the projection of the CdL data
oconee_shp <- spTransform(oconee_shp, oconee@crs)

# extract all pixels values for this county
all_pixels <- extract(oconee, oconee_shp, na.rm=T, df=T)  # may take a second

So after prepping the data (all functions you should be familiar with by now), I used extract() to extract a list of pixels in the oconee shapefile. I added the na.rm=T agruments to drop NA pixels and the df=T argument to return the list of pixels as a data.frame rather than as list, which allows me to apply the length() function and indexing to the extracted data. In the case of this raster, extract() returned a data.frame with two columns, an ID column, and a column with the actual data called CDL_2016_45073. The ID column can be useful when you’re working with multiband rasters. It allows you to apply extract to multiple layers in these rasters and to keep tract of which pixels belong in which band. Let’s find the percent of Oconee County covered in water (code 111 in the CropScape data). First, we have to pull the data out of the extracted data.frame - in this case we don’t need to worry about the ID column since we only have one band of data, but if you had multi-band data you could use filter() to pull out data only for the band of interest to you:

ras_data <- all_pixels$CDL_2016_45073

# find the percent of pixels == 111
length(ras_data[ras_data == 111])/length(ras_data)
## [1] 0.06894612

Another more elegant option here is to create a function to compute the amount of water in a shapefile and feed that to the extract function. This returns only the percent water, rather than all the pixel values for the county boundary.

# write a function to 
water <- function(x, na.rm = na.rm) {
  length(x[x==111])/length(x)
}
extent_water <- extract(oconee, oconee_shp, fun = water, na.rm=T, df=T)
extent_water
##   ID CDL_2016_45073
## 1  1     0.06894612

You can also extract based on a buffer around a point, which can come in handy. More info on how to do this can be found here.

A tip here. As you’re extracting rasters, it’s much less computationally intensive to reproject the shapefile to which you are extracting than the raster… rasters are often much bigger and take much longer to reproject.

If you think you’ll be extracting raster values to points or polygons, I strongly recommend this tutorial. Finally, for the very bold folks who have to extract massive amounts of high resolution raster data, you will definitely want to check out this post.

Raster to point

Extracting from a raster to a series of points follows the same pattern. If you have a set of points in Oconee County (alas, my home county has little GIS data so I was unable to find any), simply do the following:

value <- extract(oconee, mypoints)

This will return the value of the CropScape raster at each point in mypoints.

Cropping

We’ve already learned how to crop shapefiles and rasters, but here are a few reminders and a new cropping tool from the rgeos package just to broaden your toolkit. Say I want to extract us.cities in the state of California. As you now know, we can do this using the crop() function from the raster package.

library(raster)
states <- shapefile("./data/states.shp")
ca <- states[states@data$STATE_NAME == "California",]
ca_cities <- crop(us.cities, extent(ca))

plot(ca)
plot(ca_cities, add=T)

Don’t forget to check the projection of both shapefiles before you crop! Notice too that the crop() function crops to the extent of the CA shapefile, which is a square (bounding box). This is why some cities outside of CA are included. If you want to crop to a shape, use the gIntersection() tool from rgeos or some of the tools we saw during the raster module:

library(rgeos)
ca_cities_only <- gIntersection(us.cities, ca)
plot(ca)
plot(ca_cities_only, add=T)


Attribute Joins

The best way to merge an existing data.frame to a spatial object is with the merge() function. Remember that for this function to work, you’ll need to have a column in each dataset with a unique identifier that can be used to link the two datasets. These two columns can have different names, but their content must be the same (i.e. if the unique ID for Logan UT in one dataset is “10012” then the other dataset ID for Logan needs to be exactly the same, “10012”). This can be tricky at times and can involve some string manipulation to get IDs in the same format. Here’s a quick example to give you a sense of how the merge() function works. Imagine you want to link our us.cities data to a shapefile of US states.

states <- readOGR("./data/states.shp")
## OGR data source with driver: ESRI Shapefile 
## Source: "./data/states.shp", layer: "states"
## with 51 features
## It has 5 fields
plot(states)
plot(us.cities, add=T, col = "red")

Merging these datasets is pretty easy since they both contain a column that holds the abbreviations of US states. We’ll use these columns to merge the datasets. I always like to start by inspecting the two columns I’ll use to perform the merge to check for any differences or weird values:

unique(states@data$STATE_ABBR)
##  [1] HI WA MT ME ND SD WY WI ID VT MN OR NH IA MA NE NY PA CT RI NJ IN NV
## [24] UT CA OH IL DC DE WV MD CO KY KS VA MO AZ OK NC TN TX NM AL MS GA SC
## [47] AR LA FL MI AK
## 51 Levels: AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA ... WY
unique(us.cities@data$country.etc)
##  [1] "TX" "OH" "CA" "GA" "NY" "OR" "NM" "LA" "VA" "PA" "FL" "IA" "AK" "IN"
## [15] "MA" "MI" "MD" "MN" "WI" "IL" "CO" "NC" "NJ" "AL" "WA" "ME" "AZ" "TN"
## [29] "NE" "MT" "MS" "ND" "MO" "ID" "UT" "KY" "CT" "OK" "NV" "WY" "SC" "WV"
## [43] "NH" "AR" "RI" "DE" "HI" "KS" "VT" "SD" "DC"
mode(states@data$STATE_ABBR)
## [1] "numeric"
mode(us.cities@data$country.etc)
## [1] "character"

The first issue is that the states column is a numeric vector while the us.cities column is a character vector. Let’s fix this:

states@data$STATE_ABBR <- as.character(states@data$STATE_ABBR)
unique(states@data$STATE_ABBR)
##  [1] "HI" "WA" "MT" "ME" "ND" "SD" "WY" "WI" "ID" "VT" "MN" "OR" "NH" "IA"
## [15] "MA" "NE" "NY" "PA" "CT" "RI" "NJ" "IN" "NV" "UT" "CA" "OH" "IL" "DC"
## [29] "DE" "WV" "MD" "CO" "KY" "KS" "VA" "MO" "AZ" "OK" "NC" "TN" "TX" "NM"
## [43] "AL" "MS" "GA" "SC" "AR" "LA" "FL" "MI" "AK"

Nice. Now let’s merge these datasets. By using the by.x and by.y arguments, we can specify the columns in each data set with different names we intend to use for the merge:

city_state <- merge(us.cities@data, states@data, by.x = "country.etc", by.y = "STATE_ABBR")
head(city_state)
##   country.etc          name    pop   lat    long capital STATE_NAME
## 1          AK     Juneau AK  31187 58.30 -134.42       2     Alaska
## 2          AK  Anchorage AK 279428 61.18 -149.19       0     Alaska
## 3          AL     Dothan AL  62449 31.24  -85.41       0    Alabama
## 4          AL Birmingham AL 229300 33.53  -86.80       0    Alabama
## 5          AL     Mobile AL 188467 30.68  -88.09       0    Alabama
## 6          AL Huntsville AL 169155 34.71  -86.63       0    Alabama
##   DRAWSEQ STATE_FIPS         SUB_REGION
## 1      51         02            Pacific
## 2      51         02            Pacific
## 3      43         01 East South Central
## 4      43         01 East South Central
## 5      43         01 East South Central
## 6      43         01 East South Central

Our final data set links every entry in the us.cities shapefile (1005 in total) to it’s state and appends data from the state shapefile to the us.cities information. Notice that I’m referencing the data.frame in the spatial objects using @data for the merge. This is because merge() will not bring together two shapefiles with different geometries. This makes sense… it’s impossible for R to create a spatial object with both points and polygons held in the same place. If you need to merge spatial objects, remember the over() function we discussed above! If, however, you want to merge a data.frame to your spatial object (say data describing characteristics of the us.cities or states) then you can simply use the merge() function as shown above.


Computing Geometries

Distance

ATTENTION: Don’t even think about computing geometries if your data is NOT projected!!

Oftentimes, we want to do something like compute the distance between two objects or the area of an object. We can do this in R! Let’s say I’ve got data in cemeteries in Utah (weird, but this data exists). Say I want to find the average distance between cemeteries in the state:

library(rgeos)
cem <- shapefile("./data/Cemeteries.shp")
dist_matrix <- gDistance(spgeom1 = cem, spgeom2 = NULL, byid=T)
dist_matrix[1:10, 1:10]
##            1          2         3        4          5         6         7
## 1       0.00 259099.244 315823.76  95417.1 251323.499  81156.52 338058.50
## 2  259099.24      0.000 125252.03 341570.0   7812.565 195608.38 169224.33
## 3  315823.76 125252.034      0.00 375628.8 126793.686 236327.68  44596.34
## 4   95417.10 341570.008 375628.81      0.0 333774.765 146882.00 388880.24
## 5  251323.50   7812.565 126793.69 333774.8      0.000 187863.80 170313.75
## 6   81156.52 195608.382 236327.68 146882.0 187863.799      0.00 257072.46
## 7  338058.50 169224.330  44596.34 388880.2 170313.752 257072.46      0.00
## 8  376135.18 131112.787 206926.27 464773.1 138104.619 322029.01 247787.01
## 9  398451.36 151137.782 217565.53 486778.6 158354.598 343563.16 256707.15
## 10 346127.99  96206.562 175803.40 433140.8 103394.623 289131.68 218057.53
##            8         9        10
## 1  376135.18 398451.36 346127.99
## 2  131112.79 151137.78  96206.56
## 3  206926.27 217565.53 175803.40
## 4  464773.06 486778.64 433140.79
## 5  138104.62 158354.60 103394.62
## 6  322029.01 343563.16 289131.68
## 7  247787.01 256707.15 218057.53
## 8       0.00  22491.96  35497.02
## 9   22491.96      0.00  54962.35
## 10  35497.02  54962.35      0.00

gDistance() from the rgeos package computes the distance between points either in two separate SPDF objects (you can specify spgeom1 and spgeom2) or the distance of points within a shapefile (spgeom2 = NULL as above). The function returns a matrix of distance between points. Note that the diagonal is zero, i.e. the distance between a point and itself is zero. To avoid skewing our average distance, I’ll set the diagonal of this matrix to NA then compute the average distance between cemeteries:

diag(dist_matrix) <- NA
mean(dist_matrix, na.rm=T)
## [1] 216501

Area

We can also use rgeos to compute area of polygons. Let’s compute the surface area of all counties in Utah:

ut <- cty[cty@data$STATEFP == 49,]
ut_areas <- gArea(ut, byid=T)

HAH! Gotcha! Be sure to project your data!! I will project my Utah dataset to the projection used frequently by folks in the state of Utah:

crs <- "+proj=utm +zone=12 +datum=NAD83 +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0"
ut <- spTransform(ut, crs)

ut_areas <- gArea(ut, byid=T)
head(ut_areas)
##         252         253         254         255         256         257 
##  6697859970 17422718654 10632734968  1580948776  2085507623  4869604901

These are huge numbers because our projection is in meters. If you’ll be working a lot with distance, area or other geometries, check out the rgeos package. There are other great functions like gIntersection() and gOverlaps() that may be of interest to you.


Getting dplyr to talk to sp

You may have noticed a few annoying things about how dplyr talks to objects in the sp package. If I want to take a shapefile, say my us.cities shapefile, and use dplyr to extract cities with a population of more than 500,000 people, then I do the following:

big_cities <- us.cities@data %>% filter(pop > 500000) %>% arrange(desc(pop))
head(big_cities)
##              name country.etc     pop   lat    long capital
## 1     New York NY          NY 8124427 40.67  -73.94       0
## 2  Los Angeles CA          CA 3911500 34.11 -118.41       0
## 3      Chicago IL          IL 2830144 41.84  -87.68       0
## 4      Houston TX          TX 2043005 29.77  -95.39       0
## 5      Phoenix AZ          AZ 1450884 33.54 -112.07       2
## 6 Philadelphia PA          PA 1439814 40.01  -75.13       0

That’s easy enough. But did you notice something about our data? We started with a nice SpatialPointsDataFrame (us.cities) and ended up with a data.frame (big_cities):

class(us.cities)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
class(big_cities)
## [1] "data.frame"

The big_cities data.frame still has the coordinate information in it, so it’s easy to transform it back into a SPDF:

big_cities_SPATIAL <- SpatialPointsDataFrame(coords = cbind(big_cities$long, big_cities$lat), data = big_cities)
plot(big_cities_SPATIAL)

…but this is annoying, right? There’s a fairly new package called spdplyr which, as you may have guessed, integrates the sp package with dplyr. If you haven’t already installed this package, be sure to do so. It’s pretty useful.

library(spdplyr)
big_cities <- us.cities %>% filter(pop > 500000) %>% arrange(desc(pop))
class(big_cities)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"

Hooray! And just in case you don’t believe it’s spatial:

plot(big_cities)

Keep the spdplyr package in mind as you wrangle spatial data. It’ll save the annoying step of having to transform your data between a spatial object and a data.frame.

Additional resources


  1. If you want to get super nerdy about spatial or attribute joins, or if you face your own join-challenges with your own data, type vignette("over") to learn more about the join potential in the sp package.

  2. You can find the CropScape metadata here

  3. FYI, you can look up the Census STATEFP and COUNTYFP for a county here.