Objectives

  • Learn how R opens, stores, and organizes shapefiles.
  • Learn how R handles projection.
  • dplyr a shapefile.
  • MAKE A MAP

Pre-lab assignment

  • If you are not yet familiar with GIS and spatial analysis, you should probably read this overview.
  • The entire Spatial Data Science with R online textbook is excellent. You can start by reading the chapter on vector data. If you’re unfamiliar with projection, you should also look at the chapter on projection. Even if you’re familiar with projection, you may want to review this chapter as it addresses projection in R.

Set up

Hooray! We’ve finally made it to my favorite type of data: spatial data! Well, actually nearly all of the data we’ve already played contains some sort of spatial information (country names, state names, etc…), it just did not contain the coordinate and projection information required to make maps and conduct spatial analyses. This tutorial will focus on giving you the tools required to import, project, and create spatial data. There are several packages used by the R community to bring spatial data into R. These are:

  • sp
  • rgdal
  • sf
  • Our friend the tidyverse

Make sure you have these packages installed and loaded on your machine.

library(sp) # old school - don't really need this package, just loading for reference
library(sf)
library(spData) # a bonus package that has some cool shapefiles
library(ggspatial)  # if you want to boost your ggplot mapping skillz

# plus, as always:
library(tidyverse)

Spatial objects in R

First, some history…

sf (simple features) is the new go-to package for spatial analysis in R, replacing the sp package of yore.1 sf is a descendant of the sp package, made by the same folks, and has good inter-operability with sp. You can look through the sf and sp documentation to get a sense of all the things you can do with these powerful packages. The team behind the sp package initially created a set of new data classes that integrate our best friend the data.frame with some friends who at times can be very difficult to hang out with: point, polygon, and raster. These new data classes include the following:

getClass("Spatial")
## Class "Spatial" [package "sp"]
## 
## Slots:
##                               
## Name:         bbox proj4string
## Class:      matrix         CRS
## 
## Known Subclasses: 
## Class "SpatialPoints", directly
## Class "SpatialMultiPoints", directly
## Class "SpatialGrid", directly
## Class "SpatialLines", directly
## Class "SpatialPolygons", directly
## Class "SpatialPointsDataFrame", by class "SpatialPoints", distance 2
## Class "SpatialPixels", by class "SpatialPoints", distance 2
## Class "SpatialMultiPointsDataFrame", by class "SpatialMultiPoints", distance 2
## Class "SpatialGridDataFrame", by class "SpatialGrid", distance 2
## Class "SpatialLinesDataFrame", by class "SpatialLines", distance 2
## Class "SpatialPixelsDataFrame", by class "SpatialPoints", distance 3
## Class "SpatialPolygonsDataFrame", by class "SpatialPolygons", distance 2

sf on the other hand has a single sf class that incorporates a special geometry column into a dataframe. This geometry column can include different types of spatial data (e.g. points, polygons, lines, or multiple types). This is convenient because all sf objects can basically be treated like data.frames (PARTY!).

In sp world, the sub-classes followed by DataFrame like SpatialPointsDataFrame, include additional attributes in the form of a data.frame. For example, if you have a SpatialPolygon object of US Census blocks, then this object would only include the spatial information required to plot the Census block polygons. With a SpatialPolygonDataFrame, you can store this spatial information and additional data such as population, race, and income associated with each polygon. This may seem trivial, but, as I’ll show you below, this makes for an incredibly powerful way to organize, visualize, and analyze spatial data. In this class, we’ll focus on using sf objects since this is quickly becoming the norm, but I wanted you to also know about sf’s cool grandma, sp.

Enough about the history of spatial packages in R… let’s open some shapefiles! If you’re like me, you have folders full of shapefiles. A single shapefile is often split into many files (e.g. .cpg, .dbf, .prj, etc.), the most important of which is the sacred .SHP file. To get all of this info into an organized spatial object in R is easy. As I mentioned before, you can use lots of packages to do this (rgdal, sf, and sp notably), but in this tutorial, we’ll focus on sf:

states <- st_read("./data/states.shp")

Here, I use the st_read() function to open my .shp file. Note that this function automatically finds all of the ancillary files associated with our shapefile (.dbf, .prj, etc). If for some reason these files are missing or broken, st_read() will throw an error, so make sure your shapefile contains all necessary pieces prior to loading in R!

To make things easier, in this tutorial we’ll mostly use data from the spData package. It contains some generic shapefiles for use in R, already formatted in sf or sp format and easy to load. Check out the full list here. Today, we’ll work with the us_states object:

# make sure you've installed and loaded spData first!
data("us_states")
class(us_states)
## [1] "sf"         "data.frame"

Excellent. Note that this us_states object is listed as both a simple feature (sf) and as a data.frame. Let’s inspect our data:

glimpse(us_states)
## Rows: 49
## Columns: 7
## $ GEOID        <chr> "01", "04", "08", "09", "12", "13", "16", "18", "20", ...
## $ NAME         <chr> "Alabama", "Arizona", "Colorado", "Connecticut", "Flor...
## $ REGION       <fct> South, West, West, Norteast, South, South, West, Midwe...
## $ AREA         [km^2] 133709.27 [km^2], 295281.25 [km^2], 269573.06 [km^2],...
## $ total_pop_10 <dbl> 4712651, 6246816, 4887061, 3545837, 18511620, 9468815,...
## $ total_pop_15 <dbl> 4830620, 6641928, 5278906, 3593222, 19645772, 10006693...
## $ geometry     <MULTIPOLYGON [°]> MULTIPOLYGON (((-88.20006 3..., MULTIPOLY...

Note that if you love sp (full disclosure, I still use sp from time to time), it’s really easy to coerce this back to a sp object.

states_sp <- as_Spatial(us_states)
class(us_states)
## [1] "sf"         "data.frame"

But for now we’ll stick with sf. Let’s take a peak at the other info stored in our states object. We can also access information about the bounding box of the spatial object:

st_bbox(us_states)
##       xmin       ymin       xmax       ymax 
## -124.70415   24.55868  -66.98240   49.38436

This is a matrix of bounding box coordinates with column names c(min,max) with the first row showing the longitude (x-axis) and the second row showing the latitude (y-axis).

We can also access projection information:

st_crs(us_states)
## Coordinate Reference System:
##   User input: EPSG:4269 
##   wkt:
## GEOGCS["NAD83",
##     DATUM["North_American_Datum_1983",
##         SPHEROID["GRS 1980",6378137,298.257222101,
##             AUTHORITY["EPSG","7019"]],
##         TOWGS84[0,0,0,0,0,0,0],
##         AUTHORITY["EPSG","6269"]],
##     PRIMEM["Greenwich",0,
##         AUTHORITY["EPSG","8901"]],
##     UNIT["degree",0.0174532925199433,
##         AUTHORITY["EPSG","9122"]],
##     AUTHORITY["EPSG","4269"]]

This slot holds a string that describes the projection of the data, including both the EPSG code and the proj4string format of the projection. Wait, what? Hang on, I’ll go through the projection stuff in more detail below.

Finally, the moment we’ve all been waiting for… let’s plot this baby! For now, we will use base R to plot this object, but in a future class we’ll learn about more sophisticated mapping tools.

If you just run plot(us_states), you’ll get several maps, one for each attribute in the sf object:

plot(us_states)

To just plot the geometry, you use the st_geometry() function:

plot(st_geometry(us_states))

And to plot only a single attribute, try this:

plot(us_states["total_pop_15"], main = "Population in 2015")


Projection

SO… what the heck is a projection?

Have you ever thought about how a map is two-dimensional but the world is, in fact, three-dimensional? Anytime we visualize our three-dimensional reality2 in two-dimensions, we have to distort (squish, stretch) the real shape of things so they can fit on a piece of paper or a screen. Maps are two-dimensional objects; the Earth is not. The way we portray (pseudo-)spherical Earth on a flat surface is by using a projection. Each projection is associated with a particular coordinate system. A coordinate system is a way to locate everything on Earth’s surface in x and y space. There are a bazillion coordinate systems, and each coordination system comes with strong assumptions and distortions (and fan-base):

  • Geographic Coordinate System (GCS): good ol’ latitude and longitude (degrees, minutes, seconds); we think of data in GCS as unprojected (if this doesn’t make sense to you, review the tutorial I’ve made exclusively on projection);
  • Universal Transverse Mercator (UTM): lots of zones to choose from, Utah sits in 12N;
  • State Plane Coordinate System: some states have their own systems, i.e. North Carolina Coordinate System
  • National systems: If working in another country, check out country-specific projections. Kandawala Sri Lanka Grid anyone?

Each coordinate system is tied to a datum. This is essentially a reference point, the starting point from which coordinates are measured (i.e. lat and lon are based on the equator and prime meridian). In R we can access the projection information for a sf object using the st_crs() function you saw above:

st_crs(us_states)
## Coordinate Reference System:
##   User input: EPSG:4269 
##   wkt:
## GEOGCS["NAD83",
##     DATUM["North_American_Datum_1983",
##         SPHEROID["GRS 1980",6378137,298.257222101,
##             AUTHORITY["EPSG","7019"]],
##         TOWGS84[0,0,0,0,0,0,0],
##         AUTHORITY["EPSG","6269"]],
##     PRIMEM["Greenwich",0,
##         AUTHORITY["EPSG","8901"]],
##     UNIT["degree",0.0174532925199433,
##         AUTHORITY["EPSG","9122"]],
##     AUTHORITY["EPSG","4269"]]

Excellent. Our data is currently in the NAD83 Geographic Coordinate System which plots locations on a sphere using longitude and latitude with the North American Datum (NAD) 1983 as a reference line (but does not project these coordinates into two-dimensional space). This is fine if we’re only interested in visualizing our data spatially, but if we have any plans to use the spatial data to compute distance or area, then we need to project our data (from 3D to 2D). This is fairly straightforward in R. Say we want to reproject this data into USA Contiguous Albers Equal Area Conic:

states <- st_transform(us_states, CRS("+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs"))

I got that crazy long string by searching for the projection on spatialreference.org and clicking the Proj4 link. R expects projection information to come in this proj4 format.

If you happen to have a shapefile that has an undefined projection in R, but you happen to know what projection the data is in, you can set the projection as follows:

shp <- st_read("./path/to/shp.shp")
st_crs(shp) <- CRS("+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs")

People tend to hate projections3 and generally don’t understand them very well. It is, however, VERY important that you project your data correctly. If you fail to do this, it’s highly likely that your spatial analyses will be wrong. So let’s learn how this stuff works once and for all.

Finally, don’t forget that the map projection you use strongly impacts the way you visualize and analyze your data (see here or here).


Creating a new shapefile in R

Let’s say I’m a volcanologist and I find a really cool data describing all eruptions on Earth during the Holocene.4 The data is stored in a .csv document. We now know how to open the .csv in R and inspect its contents:

volcano <- read.csv('./data/volcano.csv')
glimpse(volcano)
## Rows: 1,496
## Columns: 13
## $ ï..Volcano.Number    <int> 210010, 210020, 210030, 210040, 211001, 211003...
## $ Volcano.Name         <chr> "West Eifel Volcanic Field", "Chaine des Puys"...
## $ Country              <chr> "Germany", "France", "Spain", "Spain", "Italy"...
## $ Primary.Volcano.Type <chr> "Maar(s)", "Lava dome(s)", "Pyroclastic cone(s...
## $ Activity.Evidence    <chr> "Eruption Dated", "Eruption Dated", "Evidence ...
## $ Last.Known.Eruption  <chr> "8300 BCE", "4040 BCE", "Unknown", "3600 BCE",...
## $ Region               <chr> "Mediterranean and Western Asia", "Mediterrane...
## $ Subregion            <chr> "Western Europe", "Western Europe", "Western E...
## $ Latitude             <dbl> 50.170, 45.775, 42.170, 38.870, 43.250, 42.600...
## $ Longitude            <dbl> 6.850, 2.970, 2.530, -4.020, 10.870, 11.930, 1...
## $ Elevation..m.        <int> 600, 1464, 893, 1117, 500, 800, 949, 458, 1281...
## $ Dominant.Rock.Type   <chr> "Foidite", "Basalt / Picro-Basalt", "Trachybas...
## $ Tectonic.Setting     <chr> "Rift zone / Continental crust (>25 km)", "Rif...

But we’re not here to learn how to work with .csv data… we have now graduated to spatial data! So how do we turn our .csv into a sf object?

volcano_sf <- st_as_sf(volcano, coords=c("Longitude", "Latitude")) 
st_crs(volcano_sf) <- "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"
plot(st_geometry(volcano_sf))

What’s happening here? First, we feed the st_as_sf() function the two things it needs to turn this .csv into an sf object: data and coordinates. The latitude and longitude are stored in original .csv. Coordinates are always listed in format x,y, so take a second to think about why longitude is associated with x (vertical lines, moving along x-axis of a graph) and latitude with y (horizontal lines, moving along the y-axis of a graph).

Notice here that I had to define the projection of my data using the prj variable. I happen to know that this data was unprojected using the WGS84 datum. For your data, you’ll need to rummage through the metadata to figure out what the projection is.

Ohhh how I want to transform this ugly base R plot into something better. That’s coming. For now, here’s a plot with base R and some data from the maptools package:

data("world") # from spData
plot(st_geometry(world))
plot(volcano_sf["Elevation..m."], pch=17, main = "How high those 'canos", add=T) # all the dots from R reading in the csv poorly

We’ll learn how to use ggplot for sf plotting magic soon. For now, note how you can plot a single attribute by calling plot(sf["attribute_name"]). So in this plot, the color of the point represents the elevation of the volcano! The pch argument changes the shape of the points. I went with volcanic triangles, but there are many other options. Also added a title using the main argument.


Maps with ggplot2

Loading spatial data in R is cool, but spatial data becomes extremely powerful when you map it. Though it wasn’t originally designed for mapping, you can do quite a bit of mapping with our friend ggplot2. Check this out:

ggplot() +
  geom_sf(data = us_states)

From there, you can do all of the usual ggplot tuning to make things look a bit better:

ggplot() +
  geom_sf(data = us_states, color = "blue", fill = "beige") +
  labs(title = "The US of A") +
  theme_minimal()

What if we want to make a new map visualizing the total population in each major region in the US?

us_states %>%
  group_by(REGION) %>%
  summarize(POP_TOTAL = sum(total_pop_15)) %>%
  ggplot() +
  geom_sf(aes(fill = POP_TOTAL)) +
  labs(fill = "Population") +
  theme_bw()

That’s right guys. I just piped a spatial object into a ggplot to make a map. One of the best things about the sf package is that it allows us to apply all of the amazing tools we’ve learned to date (hellooooo tidyverse) to wrangle and explore data.frames to spatial data.

So, say we want to highlight the fine state of Georgia. We could do something like this:

us_states %>%
  mutate(IS_IT_GA = ifelse(NAME == "Georgia", "Georgia", "Just not Georgia")) %>%
  ggplot() +
  geom_sf(aes(fill = IS_IT_GA)) + # drop data = us_states; the data is now what you pipe into the plot
  # also, move the fill into the aes() b/c it applies differently to each shape
  labs(title = "The US of A",
       fill = "Which state is it??") +
  theme_minimal() +
  scale_fill_manual(values = c("dark red", "white"))

And say we have a new shapefile that shows the roads across the US? Then we’d simply add a new geom_sf() call to the map with this new shapefile, like so:

ggplot() +
  geom_sf(data = us_states) +
  geom_sf(data = us_roads)

Note that to visualize multiple shapefiles on the same plot, they need to be in the same projection.

Finally, this post has some other useful tricks including using the ggspatial package to add a scale bar and compass:

library(ggspatial)
us_states %>%
  mutate(IS_IT_GA = ifelse(NAME == "Georgia", "Georgia", "Just not Georgia")) %>%
  ggplot() +
  geom_sf(aes(fill = IS_IT_GA)) + 
  labs(title = "The US of A",
       fill = "Which state is it??") +
  theme_minimal() +
  scale_fill_manual(values = c("dark red", "white")) +
  annotation_scale(location = "bl", width_hint = 0.5) +
  annotation_north_arrow(location = "br", which_north = "true",
                         style = north_arrow_fancy_orienteering())

There are many things we can do with spatial data I haven’t mentioned here. One of the most useful is linking multiple spatial datasets so they can “talk” to one another. You can do this by extracting information from one dataset to another (so, for example, extracting to points values of a shapefile or raster) or by merging data spatially or using columns in the dataset. We’ll talk more about extraction when we talk about rasters. To merge, check out the base R merge() function, which works like this:

shp1 <- st_read("./data/shp1.shp")
shp2 <- st_read("./data/shp2.shp")

# assume both shapefiles represent US states and have a column that lists the abbreviation for each state called ST_ABBR
# we can then do a merge like this!
shp_final <- merge(shp1, shp1, by = "ST_ABBR", all = T)

When we merge like this, the new shapefile will include all of the columns in both shp1 and shp2. Careful, for this to work, you need to be sure the state abbreviations are formatted the same. This is why I add the all = T argument which includes even rows for which the merge didn’t work. This lets you inspect your merged data to see where the merge may have failed.


Advanced mapping tools

…that we won’t cover, but that are very cool and that I encourage you to explore:


Additional Resources

  • Chapter 2 of Applied Spatial Data Analysis with R (reference on Canvas More Resources page). The authors of the book are also the authors of the sp package.
  • This tutorial created by David Beskow.
  • DataCamp’s Working with Geospatial Data in R tutorial created by Charlotte Wickham (I believe she’s the sister of Hadley Wickham of tidyverse family, pretty prolific family!).
  • More information on opening spatial data in R
  • For more on projection in R, see this tutorial.
  • For more on transforming csv files into shapefiles, see this tutorial
  • As I was building this tutorial, I really enjoyed this overview of mapping with ggplot2
  • This tutorial goes much further than I did with the maps package.
  • Tons of maps you can make with base R and spplot() can be found here.
  • Sometimes it’s very useful to visualize space-time dynamics with animations. A short intro can be found here.
  • Great set of slides reviewing tmap capabilities.
  • I also really like this overview of mapping in R.

  1. I’ve got a whole tutorial on how to work with this package here for those who are interested!

  2. Ok, ok, I’m not getting into the dimensionality of reality, but you get the idea

  3. The only person who was ever pumped about projections was an old Flemish dude named Mercator.

  4. Volcanologists rejoice! This data exists and I encourage you to use it!