Objectives


My assumptions


Pre-lab assignments

baseR


Set up

The main tool we’ll be working with in this tutorial is the raster package, so go ahead and install/load that. You can work with rasters using the sp package by loading the raster as a SpatialGridDataFrames, but the raster package is much better, so we’ll focus on that!

library(raster)

Raster 101

Raindrops on roses and rasters of land use… these are a few of my favorite things. Rasters are fantastic tools to store massive amounts of massively cool data. Think of a raster as a matrix (NO NOT THAT MATRIX)… or a grid of numbers. Each pixel in the raster has number associated with it that tells us something about the data of interest, i.e.

Let me pause here and state that you are some of the luckiest people to ever live because you have insanely easy access to incredible, beautiful visualizations of our precious planet.. in near real time, all the time, anywhere. Take a second to ponder that. Then go download Google Earth and start playing.


Get to know your raster

The data stored in rasters can be continuous (precipitation, elevation, reflectance) or categorical (land use category). In this tutorial we’ll work with both. For a raster to be a raster, we need a few key pieces of information:

  • Resolution (in space and time!)
  • Extent
  • Values (elevation, land use category, etc.)
  • Projection information

When you’re working with raster data, especially satellite imagery, think carefully about trade-offs between resolution in space and time. Some datasets are available daily at a low spatial resolution; others are available monthly at a high spatial resolution; and others are high spatial and temporal resolution, but massively large and hard for you to work with on your laptop.


The raster package

The raster package includes several raster classes:

You’ve already learned how to make a raster from scratch, so we won’t cover that here. Instead, we’ll focus on learning how to load and manipulate the most commonly used types of raster data. There are tons of raster extensions out there: .grd, .asc, .sdat, .rst, .nc, .tif, .envi, .bil, .img, .hdf. I’ve encountered all of these at some point… and the raster package can handle just about all of them. You can find the full list in the raster package documentation. And if you still feel confused about the different ways of storing raster data in the raster package, check this out. And finally, if you’re one of the unlucky ones whose data comes in HDF4 or HDF5 data (ahem NASA, get your stuff together!), read this. You’ll definitely want the HDFView tool.

We’ll get to bricks later in the tutorial. For now, let’s work with a single layer of my current favorite raster dataset, the CropScape dataset.. This is a categorical raster that covers most of the country from 2000 to present at a 30 meter - YES T-H-I-R-T-Y meter - resolution. Since the dataset was built by the folks at the U.S. Department of Agriculture, there are more detailed categorizations for agricultural lands than for ecosystems.1

I went to the CropScape website and downloaded data for Cache Valley in 2016. To do this, click on the little USA map at the top of the page, select County, find Cache, then click Submit. The map should zoom to Cache Valley. Next, click the little folder icon with a green arrow on it. Select 2016, Submit and the data should download to your computer. You’ll have to unzip the files and move both files to a local data folder. So I did all that, and can load my raster with the easiest function ever:

cache <- raster("./data/CDL_2016_49005.tif")
plot(cache)

Oh wow that’s pretty. The value of each pixel is indicative of the land use in that pixel (i.e. 1 = CORN, etc). You can download the metadata here). What’s all that pink stuff? That’s alfalfa, and that’s a longggg conversation for another day. The data we downloaded is a GeoTiff, which raster handles seamlessly. Our new cache object is a raster object. What’s stored in this object? We have extent and dimensions:

cache@extent
## class       : Extent 
## xmin        : -1323615 
## xmax        : -1267665 
## ymin        : 2148435 
## ymax        : 2223375
print(paste("This groovy data has", cache@ncols, "columns.", sep = " "))
## [1] "This groovy data has 1865 columns."
print(paste("This groovy data has", cache@nrows, "rows.", sep = " "))
## [1] "This groovy data has 2498 rows."

Or we can simply call:

dim(cache)
## [1] 2498 1865    1

We can also find the resolution of our data:

res(cache)
## [1] 30 30

30 meters!!!!!!!!!!!! Can you believe it!? Go, science, go!!!!

There’s also the projection information (cache@crs) and the data itself (cache@data). The nice thing about raster objects is that you don’t need to specify the @data piece for a number of useful functions to work. Need the max and min values of the raster?

maxValue(cache)
## [1] 255
minValue(cache)
## [1] 0

Let’s see a histogram of our pixel values:

hist(cache)

Lots of low values… possibly some zeros?

length(cache[cache == 0])
## [1] 1283028

Lots of zeros! What percent of pixels are zero?

length(cache[cache == 0])/(cache@ncols*cache@nrows)
## [1] 0.2754006

Ooohh 27 percent? This likely means a value of zero indicates missing data. Look at the plot of our raster. Raster datasets are stored with clear extents, so they are always rectangular. The region of interest, however, is rarely rectangular (unless you’re in one of those states with a bunch of square counties). This leads to a lot of missing data around the edges. Check for this and be wary of missing data when you compute summary stats on your raster… You can quickly check to see the value of this “black space” in the raster plot by indexing pixels at the corners of the raster, i,e,:

cache[1,1]
##   
## 0

This returns the value of the pixel at location row = 1, column = 1. As we expected, the value is 0. I know from looking at the CropScape metadata that 0 indicates missing data, so let’s set our 0 pixels to NA:

cache[cache == 0] <- NA
minValue(cache)
## [1] 1
maxValue(cache)
## [1] 229

Now that I’ve walked you through this example, I’ll just tell you that trim() will drop the missing data around the edge of the dataset. Sorry. But you still needed to know how to replace raster values:

cache <- trim(cache)
plot(cache)

Alternatively, there is an extent() function that can let you add a buffer of NAs around a raster… this could be useful if you have a smaller raster and want it to have the same extent as a larger raster.

Another nice trick. You can use the freq() function to build a frequency table for your data. Probably only do this for categorical data, or you’ll have a gigantic frequency table:

freq(cache)
##       value   count
##  [1,]     1   41082
##  [2,]     4     691
##  [3,]    12      30
##  [4,]    21   45817
##  [5,]    23    3692
##  [6,]    24   92755
##  [7,]    28    1652
##  [8,]    33   34813
##  [9,]    36  355800
## [10,]    37   77071
## [11,]    43    1281
## [12,]    49      16
## [13,]    57       3
## [14,]    59     239
## [15,]    61   46904
## [16,]    66      56
## [17,]    67      63
## [18,]    68      10
## [19,]    70       1
## [20,]   111   22427
## [21,]   121   66079
## [22,]   122   52769
## [23,]   123   21041
## [24,]   124    7690
## [25,]   131    5098
## [26,]   141  897396
## [27,]   142  578479
## [28,]   143    7924
## [29,]   152  865163
## [30,]   176   90818
## [31,]   190   13812
## [32,]   195   43205
## [33,]   205    1850
## [34,]   229      15
## [35,]    NA 1276169

Raster reprojection

Oftentimes two raster datasets will not line up because of differences in resolution and/or extent. The projectRaster() function from the raster package lets us do that. Let’s look at the projection info for our cache raster:

cache@crs
## CRS arguments:
##  +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0
## +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs

This is stored in CRS format, which now you guys know all about. Let’s say we want to transform our raster from its current Albers Equal Area projection to a US National Atlas Equal Area projection:

crs <- '+proj=utm +zone=11 +ellps=GRS80 +datum=NAD83 +units=m +no_defs'
cache_prj <- projectRaster(cache, res = res(cache), crs = crs, method = "ngb")
cache_prj@crs

Note that this can take awhile depending on the size of your raster. Also, BE SURE to think about whether your data is categorical or continuous before you begin reprojecting. Reprojecting rasters means that the cell centroids even size may change. To assign values to the new cells, projectRaster() computes new values using one of two interpolation procedures: nearest neighbor (ngb) or bilinear (bilinear) which is the default. If you apply bilinear interpolation to a categorical raster with integer values like 1, 22, 33, 4 etc, you’ll end up with pixels with values of averages of these categories (11.5, 27.5). This totally destroys your categorization.

Another common raster problem is resolution mis-match. What happens if you’ve got a 30 meter resolution and a 1 kilometer resolution raster that you want to have talk to each other through raster math? So my bigtimeruleofthumb for mixed resolution data is that you HAVE TO go with the coarser resolution data when you resample… if you resample data to a higher resolution, you’re lying to yourself and pretending that you have better data than you actually have. You don’t have two 30 meter datasets. You have one awesome 30 meter dataset and one less-than-awesome 1 km dataset. If you want to interact the two, then you really have two 1 km datasets. Bummer, huh? But it’s honest.

Say we want to compute the most common land use category in a 1 kilometer grid. Our cache dataset is a 30 meter dataset. We can fit 33.3333333 30 meter pixels along the edge of a single 1 kilometer pixel.

cache_1k <- aggregate(cache, fact = 33.3333, fun = modal, na.rm=T)
res(cache_1k)
## [1] 990 990
hist(cache_1k) 

aggregate() is a cool function that lets you coarsen the resolution of your raster. Here we essentially move from a 30 meter dataset to a 30*30 meter pixel (approximately 1 km). The scaling factor is set as fact. Pixel values in the new raster are determined by a function of your choosing (fun). In our case, we compute the mode landuse and assign that value to our new raster.

What if you don’t want a 900 meter resolution dataset? What if you need something that’s exactly 1000 meters (1 km)? resample() can help:

ideal_ras <- raster(ncol = 55, nrow = 75)
extent(ideal_ras) <- extent(cache_1k)
res(ideal_ras) <- c(1000,1000)
cache_1k_2 <- resample(cache_1k, ideal_ras, method = "ngb")
res(cache_1k_2)
## [1] 1000 1000
hist(cache_1k_2) 

resample() is even easier if you have a raster in the projection/extent you desire. Say that raster is called ideal_ras… then you simple call resample(current_ras, ideal_ras, method = "ngb"). This is also a way to do subtle shifts in rasters, like shifting centroids.

Other useful functions for moving your raster around:

Raster math

If you want to find the difference between values in two rasters, it’s fairly complicated:

diff <- raster_1 - raster_1

Don’t forget that a raster is just a matrix, so math is a breeze! Well, kinda… raster math that’s this simple only works if your rasters have the same extent and resolution and if the calculations are fairly simple. When datasets don’t match or you’re interested in complicated functions, the overlay() function is your friend. overlay() takes two or more rasters and applies the same function to them in a way that’s a bit more efficient than raster math or for-looping. If you get to a place where you think you’ll need overlay() let me know. Otherwise, know that the syntax goes a little something like this:

out <- overlay(raster_1, raster_2, function = mySpecialFunction)

It’s a good tool to know about, just in case. If your dataset is stacked or bricked, you’ll want to use calc() instead of overlay().

If you want to compute some statistic on the entire raster, use cellStats():

ras_mean <- cellStats(raster_1, stat = "mean", na.rm=T)

Math across multi-band rasters

calc() is a useful function to know about. Say you have a raster brick (latitude by longitude by time) with observations of landuse from 2008 to 2015 and you want to compute the most frequent category of land use through time. This is fairly simple:

calc(my_brick, fun = modal, na.rm = T)

Here calc() basically extracts a vector of observations (through time in this case) for each pixel and applies the function specified in fun to each vector.

Another nice function is stackApply(). You can use this to apply a function to a subset of rasters in a raster brick. You just have to provide the indices (i.e. locations) of the rasters of interest, for example:

stackApply(my_brick, fun = modal, indices = c(1, 1, 1, 2, 2, 2), na.rm = T)

In this case, if our my_brick object has six bands, our indexing approach tells stackApply() to compute the mode of the first three layers and the second three layers. Useful!

More on raster math here and here.


Changing raster values

You can index a raster just like a matrix. If, for whatever reason, we want to change all pixels in our cache_1k raster with a value of 4 to 938, we can do that using simple base R indexing:

cache_1k[cache_1k == 4] <- 938

Say you want to do this reclassification for many categories. Then you’ll need to use the reclassify() function. To use this function, you’ll have to use a sort of “translation” matrix that lists tells R which values you want to reclassify. For example, if you want to classify values from 1 to 5 as 333, then you’d need to do the following:

new_cache <- reclassify(cache_1k, c(1, 5, 333))
# c(1, 5, 333) as c(from, to, as)
plot(new_cache)


Cropping and masking with a shapefile

Cropping rasters in R is pretty straightforward. If you know the coordinates of your new raster extent, then:

new.extent <- extent(c(-125, -100, 20, 50))
out <- crop(raster, new.extent)

You can get the new.extent values from an existing raster by calling extent(new_extent_raster). You can also pull this info out of a shapefile, but you’ll have to slightly tweak the format of the extent by forcing it into a matrix format using matrix():

extent <- matrix(st_bbox(new_extent_shapefile), nrow=2)

If you want to crop to a specific shape rather than the extent of that shape, then you need to add an additional step. Say we want to extract the part of this land use raster located in the municipality of North Logan using a shapefile of North Logan called nl (note that these two spatial have to be in the same projection for the crop() to work).

nl_raster <- crop(cache, nl)
plot(nl_raster)

plot(nl, add=T)

Notice that this only cropped the land use raster to the extent of the North Logan shapefile, not to the actual boundaries of the shapefile. To do that, we have to add one more step:

nl_clean <- mask(nl_raster, mask = nl)
plot(nl_clean)


Masking by data values

Masking raster data is something we need to do quite a bit in the remote sensing world. We can do this using the mask() function. Let’s say we want to mask pixels classified as pasture (81) in our cache dataset:

alfalfa <- mask(cache, cache == 36, maskvalue=F)
plot(alfalfa)

Here maskvalue=F masks all pixels that are NOT classified as alfalfa. If we set this to T, we can mask all pixels classified as alfalfa:

not_alfalfa <- mask(cache, cache == 36, maskvalue=T)
plot(not_alfalfa)

If you want to mask multiple values, just add a list:

multi_mask <- mask(cache, cache %in% c(1, 36, 141, 142), maskvalue=T)
plot(multi_mask)


A few extra notes about multi-band rasters

Say you have a Landsat image for the Berkane Region of Morocco:

ls <- brick("./data/maroc.tif")
plot(ls)

The gross lines are because of an error in the Landsat 7 satellite, which you can read more about here. Notice that instead of loading the data with the raster() function, which is great for single-band data, here I use brick(). Remember that the main difference is that a RasterBrick can be linked to only one (though multi-layer) file, in our case maroc.tif. RasterStacks can be made with multiple files (band from one file merged with a band from another file) - though these objects have to have the same extent and resolution. So if we happened to have our Landsat bands in separate .tif files, we’d want to load the data using stack(). Let’s inspect this data:

dim(ls)
## [1] 1159 1603    3
summary(ls)
##         maroc.1 maroc.2 maroc.3
## Min.        327     363     438
## 1st Qu.    2259    1205     995
## Median     2536    1632    1230
## 3rd Qu.    2809    1949    1443
## Max.       4755    3858    3390
## NA's     122118  122118  122118

Most of the functions in the raster package that we’ve used on single-band rasters also work on multi-band rasters, they just spit out a result for each band in the raster.

minValue(ls)
## [1] 296 339 411
maxValue(ls)
## [1] 4858 4089 3647

We can index the third dimension in these rasters as follows:

b1 <- ls[,,1]
# or #
b1 <- ls[[1]] # note the double brackets

The two commas just tells R to select ALL rows and all columns, then only the first band of the multi-band object.


Saving rasters

You can write out a raster using the writeRaster() function in the raster package.

writeRaster(my_raster, "my/directory/raster.tif") # you can use whatever extension you like - I prefer .tif

Finding rasters

USGS data repositories allow you to search for data by time, year, satellite, resolution, etc.:

You can also download data directly from NASA (easiest when downloading large time-series). University of Maryland maintains a database of pre-processed images for the remote sensing of land surface processes. You can find lots of DEMs here (ASTER and STRM are the most widely used). You can find cool, free high-resolution topographic and bathymetric data here. This website groups data sources by application (e.g. oceanography, land surface processes, forecasting) and can be helpful to explore.


Additional resources


  1. The way we categorize data in rasters can become EXTREMELY important as we move towards spatial analysis of raster data. More to come later in the class!