Pre-lab assignment

My assumptions

base R

All new packages and functions this week, so the base R we’ve covered to date should be sufficient!

Set up

Hooray! We’ve finally made it to spatial data! Well, actually nearly all of the data we’ve already played with is spatial, it just did not contain the coordinate and projection information required to make maps and conduct spatial analyses. This tutorial will focus on giving you the tools required to import, project, and create spatial datasets. There are three main packages used by the R community to bring spatial data into R. These are:

Make sure you have these packages installed on your machine.


Spatial objects in R

sp is the mama-package of all the spatial packages. You can look through the sp documentation to get a sense of all the things you can do with this powerful package. The team behind the sp package created a set of new data classes that integrate our best friend the data.frame with some friends who at times can be very difficult to hang out with: point, polygon, and raster. These new data classes include the following:

## Class "Spatial" [package "sp"]
## Slots:
## Name:         bbox proj4string
## Class:      matrix         CRS
## Known Subclasses: 
## Class "SpatialPoints", directly
## Class "SpatialMultiPoints", directly
## Class "SpatialGrid", directly
## Class "SpatialLines", directly
## Class "SpatialPolygons", directly
## Class "SpatialPointsDataFrame", by class "SpatialPoints", distance 2
## Class "SpatialPixels", by class "SpatialPoints", distance 2
## Class "SpatialMultiPointsDataFrame", by class "SpatialMultiPoints", distance 2
## Class "SpatialGridDataFrame", by class "SpatialGrid", distance 2
## Class "SpatialLinesDataFrame", by class "SpatialLines", distance 2
## Class "SpatialPixelsDataFrame", by class "SpatialPoints", distance 3
## Class "SpatialPolygonsDataFrame", by class "SpatialPolygons", distance 2

The sub-classes followed by DataFrame like SpatialPointsDataFrame, include additional attributes in the form of a data.frame. For example, if you have a SpatialPolygon object of US Census blocks, then this object would only include the spatial information required to plot the Census block polygons. With a SpatialPolygonDataFrame, you can store this spatial information and additional data such as population, race, and income associated with each polygon. This may seem trivial, but, as I’ll show you below, this makes for an incredibly powerful way to organize, visualize, and analyze spatial data. You know me, I like data, so in this class, our spatial objects will almost always come in the DataFrame flavor, i.e. with additional attributes attached to the spatial object. Since you guys are now experts on data.frame wrangling, this should come as good news!1

Did you notice I referred to our new sp datasets as objects? This is important. In the programming world, an object stores functions and variables relevant only to that object. Huh? For example, we could create a vehicle object and assign it a number of attributes that are only relevant to vehicles, i.e. numberofwheels, color, weight, etc. We could also build functions associated with that object, i.e. compute_mileage(), enter_Batman_mode(), etc. We can bundle all of these object-specific attributes and functions in our vehicle object. In the programming world, this is known as object-oriented programming and it is very powerful.2

Our Spatial objects have a series of attributes that describe things specific to spatial data such projections and bounding boxes. Let’s say I have loaded a shapefile in R and called the new variable that stores this shapefile states. I’ll show you how to load shapefiles below. For now, let’s just focus on how R organizes the information stored in a shapefile. Let’s start by looking at the class of the shapefile object:

## OGR data source with driver: ESRI Shapefile 
## Source: "./data/states.shp", layer: "states"
## with 51 features
## It has 5 fields
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"

Excellent. We confirmed that states is, in fact, a spatial object. More precisely, it is a SpatialPolygonsDataFrame. Remember, this means that there is a data.frame with more info hiding somewhere in this object. Because this is an object, it is organized differently than a classic data.frame whose variables can be accessed using $. In R, objects contain slots, which store different types of information or even functions specific to the object. We access different slots using the @ symbol. Let’s look at some of the slots stored in our states object. We can access the actual data (or, in ArcGIS speak, the attribute table) using @data:

## 0                Hawaii       1         15            Pacific         HI
## 1            Washington       2         53            Pacific         WA
## 2               Montana       3         30           Mountain         MT
## 3                 Maine       4         23        New England         ME
## 4          North Dakota       5         38 West North Central         ND
## 5          South Dakota       6         46 West North Central         SD
## 6               Wyoming       7         56           Mountain         WY
## 7             Wisconsin       8         55 East North Central         WI
## 8                 Idaho       9         16           Mountain         ID
## 9               Vermont      10         50        New England         VT
## 10            Minnesota      11         27 West North Central         MN
## 11               Oregon      12         41            Pacific         OR
## 12        New Hampshire      13         33        New England         NH
## 13                 Iowa      14         19 West North Central         IA
## 14        Massachusetts      15         25        New England         MA
## 15             Nebraska      16         31 West North Central         NE
## 16             New York      17         36    Middle Atlantic         NY
## 17         Pennsylvania      18         42    Middle Atlantic         PA
## 18          Connecticut      19         09        New England         CT
## 19         Rhode Island      20         44        New England         RI
## 20           New Jersey      21         34    Middle Atlantic         NJ
## 21              Indiana      22         18 East North Central         IN
## 22               Nevada      23         32           Mountain         NV
## 23                 Utah      24         49           Mountain         UT
## 24           California      25         06            Pacific         CA
## 25                 Ohio      26         39 East North Central         OH
## 26             Illinois      27         17 East North Central         IL
## 27 District of Columbia      28         11     South Atlantic         DC
## 28             Delaware      29         10     South Atlantic         DE
## 29        West Virginia      30         54     South Atlantic         WV
## 30             Maryland      31         24     South Atlantic         MD
## 31             Colorado      32         08           Mountain         CO
## 32             Kentucky      33         21 East South Central         KY
## 33               Kansas      34         20 West North Central         KS
## 34             Virginia      35         51     South Atlantic         VA
## 35             Missouri      36         29 West North Central         MO
## 36              Arizona      37         04           Mountain         AZ
## 37             Oklahoma      38         40 West South Central         OK
## 38       North Carolina      39         37     South Atlantic         NC
## 39            Tennessee      40         47 East South Central         TN
## 40                Texas      41         48 West South Central         TX
## 41           New Mexico      42         35           Mountain         NM
## 42              Alabama      43         01 East South Central         AL
## 43          Mississippi      44         28 East South Central         MS
## 44              Georgia      45         13     South Atlantic         GA
## 45       South Carolina      46         45     South Atlantic         SC
## 46             Arkansas      47         05 West South Central         AR
## 47            Louisiana      48         22 West South Central         LA
## 48              Florida      49         12     South Atlantic         FL
## 49             Michigan      50         26 East North Central         MI
## 50               Alaska      51         02            Pacific         AK

Our data! In data.frame form. If you don’t believe me, class(states@data) proves that a real live data.frame is stored in our spatial object. All of our magical dplyr tools apply… want to know how many states are in each sub-region?

states@data %>%
  group_by(SUB_REGION) %>%
  summarize(n_county = n())
## # A tibble: 9 x 2
##   SUB_REGION         n_county
##   <fctr>                <int>
## 1 East North Central        5
## 2 East South Central        4
## 3 Middle Atlantic           3
## 4 Mountain                  8
## 5 New England               6
## 6 Pacific                   5
## 7 South Atlantic            9
## 8 West North Central        7
## 9 West South Central        4

Is your mind blown? Tears streaming down your face? Aren’t you glad you invested all that time in learning about data.frames and the dplyr? Notice, however, that the output of this dplyring is a data.frame, not a sp object. To keep the subset and cleaned shapefile in shapefile form, we have to use a different package called, unsurprisingly, spdplyr, which you’ll learn about in another tutorial.

Let’s take a peak at the other info stored in our states object. We can also access information about the bounding box of the spatial object:

##          min       max
## x -178.21760 -66.96927
## y   18.92179  71.40624

This is a matrix of bounding box coordinates with column names c(min,max) with the first row showing the longitude (x-axis) and the second row showing the latitude (y-axis).

We can also access projection information:

## CRS arguments:
##  +proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0

This slot holds a string that describes the projection of the dataset. This string should be in a PROJ.4 format.3 Wait, what? Hang on, I’ll go through the projection stuff in more detail below.

Together, these slots form a spatial object that we can plot. For now, we will use base R to plot this object, but in a future class we’ll learn about more sophisticated mapping tools.


Now that you understand how a spatial object is structured and its slots are accessed, let’s move on to loading different types of spatial data into R.

Loading existing shapefiles into R

If you’re like me, you have folders full of shapefiles. A single shapefile is often split into many files (e.g. .cpg, .dbf, .prj, etc.), the most important of which is the sacred .SHP file. To get all of this info into an organized spatial object in R is easy, and in fact, as is often the case in R, there are multiple packages and functions that will do this for you. Today, we’ll learn how to use the rgdal package to load a shapefile:

states <- readOGR(dsn = "./data/states.shp") 
## OGR data source with driver: ESRI Shapefile 
## Source: "./data/states.shp", layer: "states"
## with 51 features
## It has 5 fields
## Object of class SpatialPolygonsDataFrame
## Coordinates:
##          min       max
## x -178.21760 -66.96927
## y   18.92179  71.40624
## Is projected: FALSE 
## proj4string :
## [+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0]
## Data attributes:
##       STATE_NAME    DRAWSEQ       STATE_FIPS              SUB_REGION
##  Alabama   : 1   Min.   : 1.0   01     : 1   South Atlantic    : 9  
##  Alaska    : 1   1st Qu.:13.5   02     : 1   Mountain          : 8  
##  Arizona   : 1   Median :26.0   04     : 1   West North Central: 7  
##  Arkansas  : 1   Mean   :26.0   05     : 1   New England       : 6  
##  California: 1   3rd Qu.:38.5   06     : 1   East North Central: 5  
##  Colorado  : 1   Max.   :51.0   08     : 1   Pacific           : 5  
##  (Other)   :45                  (Other):45   (Other)           :11  
##  AK     : 1  
##  AL     : 1  
##  AR     : 1  
##  AZ     : 1  
##  CA     : 1  
##  CO     : 1  
##  (Other):45

The dsn argument is simply the path to the shapefile you want to open in R. The summary() of the shapefile provides some important information. I can immediately see that R has read my shapefile in as a SpatialPolygonsDataFrame. I can see the bounding box under the Coordinates heading. I see that my data is not projected (Is projected: FALSE) and the proj4string string that gives me more information about the data’s coordinate reference system. I can also see the data attributes stored in the @data part of the object and some descriptive stats about each variable. Nifty.


People tend to hate projections4 and generally don’t understand them very well. It is, however, VERY important that you project your data correctly. If you fail to do this, it’s highly likely that your spatial analyses will be wrong. So let’s learn how this stuff works once and for all.

Maps tend to be two-dimensional; the Earth is not. A coordinate system is a way to locate everything on Earth’s surface in x and y space. The way we portray (pseudo-)spherical Earth on a flat surface is by using a projection. Each projection is associated with a particular coordinate system. There are a bazillion coordinate systems, and each coordination system comes with strong assumptions and distortions (and fan-base):

Each coordinate system is tied to a datum. This is essentially a reference point, the starting point from which coordinates are measured (i.e. lat and lon are based on the equator and prime meridian). In R we can access the projection information for a Spatial object using the proj4string attribute of the object:

## CRS arguments:
##  +proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0

Excellent. Our data is currently in the WGS84 Geographic Coordinate System which plots locations on a sphere using longitude and latitude (but does not project these coordinates into two-dimensional space). This is fine if we’re only interested in visualizing our data spatially, but if we have any plans to use the spatial data to compute distance or area, then we need to project our data (from 3D to 2D). This is fairly straightforward in R. Say we want to reproject this dataset into USA Contiguous Albers Equal Area Conic:

states <- spTransform(states, CRS("+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs"))

I got that crazy long string by searching for the projection on spatialreference.org and clicking the Proj4 link. R expects projection information to come in this proj4 format.

If you happen to have a shapefile that has an undefined projection in R, but you happen to know what projection the data is in, you can set the projection as follows:

shp <- readOGR("./path/to/shp.shp")
projection(shp) <- CRS("+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs")

Finally, don’t forget that the map projection you use strongly impacts the way you visualize and analyze your data (see here or here)).

Building new shapefiles in R


Let’s say I’m a volcanologist and I find a really cool dataset describing all eruptions on Earth during the Holocene.5 The data is stored in a .csv document. We now know how to open the .csv in R and inspect its contents:

volcano <- read.csv('./data/volcano.csv')
## Observations: 1,496
## Variables: 13
## $ ï..Volcano.Number    <int> 210010, 210020, 210030, 210040, 211001, 2...
## $ Volcano.Name         <fctr> West Eifel Volcanic Field, Chaine des Pu...
## $ Country              <fctr> Germany, France, Spain, Spain, Italy, It...
## $ Primary.Volcano.Type <fctr> Maar(s), Lava dome(s), Pyroclastic cone(...
## $ Activity.Evidence    <fctr> Eruption Dated, Eruption Dated, Evidence...
## $ Last.Known.Eruption  <fctr> 8300 BCE, 4040 BCE, Unknown, 3600 BCE, 1...
## $ Region               <fctr> Mediterranean and Western Asia, Mediterr...
## $ Subregion            <fctr> Western Europe, Western Europe, Western ...
## $ Latitude             <dbl> 50.170, 45.775, 42.170, 38.870, 43.250, 4...
## $ Longitude            <dbl> 6.850, 2.970, 2.530, -4.020, 10.870, 11.9...
## $ Elevation..m.        <int> 600, 1464, 893, 1117, 500, 800, 949, 458,...
## $ Dominant.Rock.Type   <fctr> Foidite, Basalt / Picro-Basalt, Trachyba...
## $ Tectonic.Setting     <fctr> Rift zone / Continental crust (>25 km), ...

But we’re not here to learn how to work with .csv data… we have now graduated to spatial data! So how do we turn our .csv into a SpatialPointsDataFrame (SpatialPoints because we’re dealing with coordinates of eruptions and DataFrame because there are more attributes than these coordinates, like Volcano.Name and Country)?

prj <- CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")

select <- dplyr::select #this tells R to use the select() function from the dplyr package, not the select function stored in some of the spatial packages
coords <- volcano %>% select(Longitude, Latitude)
volcano_pt <- SpatialPointsDataFrame(coords = coords, data = volcano, proj4string = prj)

Notice here that I had to define the projection of my dataset using the prj variable. I make the prj variable a CRS object using the CRS() function. This essentially smooshes the long projection string into a format that R likes. Be sure you’ve formatted your projection data using the CRS() function when you’re creating shapefiles. I have to create a coords variable that stores the longitude and latitude for my point dataset. I used dplyr to select these two columns from the full volcano dataset. Then we mix it all together to bake a beautiful volcanic spatial cake using the SpatialPointsDataFrame() function. To learn more about this function, you know what to do (?SpatialPointsDataFrame).

Ohhh how I want to transform this ugly base R plot into something better. That’s coming. For now, here’s a plot with base R and some data from the maptools package:

plot(volcano_pt, add=T, col='orange', pch=2)

Note the add=T argument. This lets you add multiple layers to one plot. Could be useful for the homework! :)


Most of the polygon data I work with has already been built by someone else, so I don’t often build polygons from scratch. I can, however, think of many, many instances when this would be a very useful skill to master. I found a fantastic tutorial HERE on creating SpatialPolygon objects. I’ll summarize it below.

Our ultimate objective is to have a complete SpatialPolygons object. To do this in R, we have to build each individual polygon and bundle them all together. Let’s start by building a set of simple polygons describing the outline of the main part of a house and of the roof (in this example, the coordinates are arbitrary; in the real world, you’d have actual coordinates). The rbind() function pulls together each set of coordinates (i.e. c(1,1)) and binds them together in separate rows to create a mini-matrix of coordinates. The hole=T option in the house2.door variable is a cool option that tells R to view that polygon as detracting from the area of the larger polygon it is a part of, i.e. when you compute the surface area of House 2, don’t include the door.

house1.building <- Polygon(rbind(c(1, 1), c(2, 1), c(2, 0), c(1, 0)))

house1.roof <- Polygon(rbind(c(1, 1), c(1.5, 2), c(2, 1)))

house2.building <- Polygon(rbind(c(3, 1), c(4, 1), c(4, 0), c(3, 0)))

house2.roof <- Polygon(rbind(c(3, 1), c(3.5, 2), c(4, 1)))

house2.door <- Polygon(rbind(c(3.25, 0.75), c(3.75, 0.75), c(3.75, 0), c(3.25, 
    0)), hole=T)

Now, let’s bind the individual Polygon objects into a Polygons object for House 1 and House 2.

h1 <- Polygons(list(house1.building, house1.roof), "H1")
h2 <- Polygons(list(house2.building, house2.roof, house2.door), "H2")

And finally, let’s bundle everything into a SpatialPolygon and plot our beautiful houses:

# create spatial polygons object from lists A SpatialPolygons is like a
# shapefile or layer.
houses <- SpatialPolygons(list(h1, h2))

SpatialPolygons are cool and all, but SpatialPolygonsDataFrames are much cooler. How do we add data to our new SpatialPolygon? First let’s make some attributes:

attr <- data.frame(attr1=c(0,100), attr2 = c(7, 24), row.names = c("H1", "H2"))
colnames(attr) <- c("Accessibility", "Party duration")
##    Accessibility Party duration
## H1             0              7
## H2           100             24

Now to make our SPDF. We’ll learn more about spplot(), the plotting tool in the sp package in a later lesson:

houses.df <- SpatialPolygonsDataFrame(houses, attr)

Conclusion? If you want to have a party, make sure your house has a door.

You can apply the same basic principles to make a SpatialLines object.

Writing out spatial data

For shapefiles, I recommend using the rgdal package. Here, I’m writing out the houses.df SpatialPolygonDataFrame we built above:

writeOGR(obj = houses.df, dsn="./data/house.shp", layer="house", driver="ESRI Shapefile", overwrite=T)

The writeOGR() function requires you input the name of the object you want to save (obj), the destination for the object (dsn), the layer name you want to give the object (layer) and the driver, or data type, you want to write out (driver). I’ve written out a classic ESRI Shapefile but there are lots of other drivers you can use found here.

There’s lots more information on writing out geospatial data in R here.

Additional Resources

  1. I know, I know… you’re dying to know if this means we can use the dplyr with spatial data!? YES WE CAN! We’ll get there.

  2. If you’re interested in how spatial objects are built in R, check this out

  3. Check out this website for allllll the info you could ever want about proj.4 strings.

  4. The only person who was ever pumped about projections was an old Flemish dude named Mercator.

  5. Volcanologists rejoice! This dataset exists and I encourage you to use it in your class project.