Objectives

Understand what constitutes a point process
Detect and create point-pattern density surfaces
Identify how points are clustered in space (uniform, random, clustered).
Visualize market point processes.

Pre-lab assignment

Read Chapter 8: Point Pattern Analysis of the R Spatial online textbook. Our lab will closely follow this chapter, with some fun bonuses weaved in from other sources, the most notable of which is:
Analysing spatial point patterns in R by Adrian Baddeley. This is an immensely helpful overview of point pattern theory and its implementation in R. If you plan on applying point pattern analysis to your data, I strongly recommend you download and read this text, particularly pages 6 - 17 and 178 - 180.

My assumptions

In this tutorial, I’ll be mostly using base R to plot the results of our analyses. You can generate similar plots using ggplot2, but since the focus of this lab is understanding point pattern analysis and the spatstat package, we will only use the simple plot() function.
I assume that you know what I mean when I talk about a variable’s “distribution”, you understand what a goodness-of-fit test is (i.e. \(\chi^2\)), and you are familiar with cumulative distribution functions. No? Then watch this for a review of CDF and this for a review of \(\chi^2\).

Set up

library(sp)
library(tidyverse)
library(spatstat)
library(raster)

NOTE: Point pattern analysis depends on DISTANCES so be absolutely sure that your data is in a PROJECTED COORDINATE SYSTEM.

Our data

Today, I’ll work with a few data sets. Many some standard with the spatstat package (lansing, longleaf, etc.). One set of data you will have to download from Chapter 8 of the R Spatial online textbook. At the top of Chapter 8, you will see hyperlinked crime, city, and census .rds files which you should download and store locally on your computer before beginning the tutorial. These .rds files are shapefiles containing census data, a city boundary, and incidences of crime.

city <- readRDS("./data/city.rds")
census <- readRDS("./data/census2000.rds")
crime <- readRDS("./data/crime.rds")

plot(city)
plot(crime, add=T, col = "red" , cex = 0.5, pch = "*")

The city SpatialPolygonsDataFrame delineates the boundary of a city. The census data set contains lots of data on the demographics of census blocks within the city. Crime contains information about crimes that occurred in the city as well as their location. The types of crimes and number of observations within each crime category are listed below:

crime@data %>% group_by(CATEGORY) %>% summarize(n = n()) %>% arrange(desc(n))

## # A tibble: 15 x 2
##                CATEGORY     n
##                   <chr> <int>
##  1          Petty Theft   665
##  2            Vandalism   355
##  3      Drunk in Public   232
##  4     Vehicle Burglary   221
##  5 Residential Burglary   219
##  6                  DUI   212
##  7             Assaults   172
##  8  Commercial Burglary   143
##  9          Grand Theft   143
## 10   Drugs or Narcotics   134
## 11           Auto Theft    86
## 12              Robbery    49
## 13              Weapons    15
## 14                Arson     9
## 15          Hate Crimes     6

What is a point process?

Point data gives the location of objects (trees) or events (crime) within an area. Often, in addition to coordinates, points have other data attached to them called marks. Marks are essentially attributes of the point (i.e. species of tree, type of crime). Marks could be categorical (type of crime) or continuous (diameter of tree). Marks are different from covariates. Marks are attributes associated with the point. Covariates are any kind of data that can be linked to the points as an explanatory variable (climate data, elevation, socioeconomic data, land use, etc.). We can answer a number of important questions with point-pattern data, including:

What is the intensity of a process (e.g. crime hot spots)?
How are the points distributed in space (e.g. uniform, random, clustered, see below)?
How does the intensity of points covary with other processes (e.g. species density and elevation, cancer and pollution exposure)?
How do points with different marks segregate across space (e.g. how do the distributions of species age and size vary differently across space?)

We can think of a point process as a black box that generates a random observed spatial point pattern according to some rules. Point processes are assumed to be (1) random (locations and numbers of points are random) and (2) do not have a serial order (unless there are marks attached). Point processes are located within a sampling window which is important to identify and define.

Introduction to `spatstat`

In this tutorial, we’ll be working with the spatstat package, so make sure you’ve installed this on your machine. spatstat stores data a bit differently than the sp package, so I’ll begin by explaining the peculiarities of this package. spatstat classifies objects in the following way:

ppp: planar point pattern
owin: spatial region or observation window
im: pixel image
psp: a pattern of line segments (we won’t cover this in our class, but if you are working with polylines and are interested, check out this text)
tess: tesselations, tiling using shapes, think shapefile

There are a series of methods associated with each class. To get this list, type:

methods(class = "ppp") # replace ppp with class name of interest

##   [1] [                [<-              affine           anyDuplicated   
##   [5] as.data.frame    as.im            as.layered       as.owin         
##   [9] as.ppp           auc              berman.test      boundingbox     
##  [13] boundingcentre   boundingcircle   boundingradius   by              
##  [17] cdf.test         circumradius     closepairs       closing         
##  [21] coerce           connected        coords           coords<-        
##  [25] crossdist        crosspairs       cut              density         
##  [29] dilation         distfun          distmap          domain          
##  [33] duplicated       edit             envelope         erosion         
##  [37] fardist          flipxy           Frame<-          has.close       
##  [41] head             identify         initialize       intensity       
##  [45] iplot            is.connected     is.empty         is.marked       
##  [49] is.multitype     kppm             markformat       marks           
##  [53] marks<-          multiplicity     nnclean          nncross         
##  [57] nndensity        nndist           nnfun            nnwhich         
##  [61] nobjects         npoints          opening          pairdist        
##  [65] pcf              periodify        pixellate        plot            
##  [69] ppm              print            quadrat.test     quadratcount    
##  [73] quantess         rebound          relevel          relrisk         
##  [77] rescale          rhohat           roc              rotate          
##  [81] round            rounding         rshift           scalardilate    
##  [85] scanmeasure      segregation.test sharpen          shift           
##  [89] show             slotsFromS3      Smooth           Smoothfun       
##  [93] split            split<-          subset           summary         
##  [97] superimpose      tail             unique           unitname        
## [101] unitname<-       unmark           unstack          Window          
## [105] Window<-        
## see '?methods' for accessing help and source code

It’s fairly easy to transform data between SpatialPoints sp objects and the ppp objects used in spatstat. Let’s start by loading data from the spatstat package:

data("swedishpines")
x <- swedishpines
plot(x)

This sample data sets plots the occurrence of Swedish Pines within a small window.

class(x)

## [1] "ppp"

summary(x)

## Planar point pattern:  71 points
## Average intensity 0.007395833 points per square unit (one unit = 0.1 
## metres)
## 
## Coordinates are integers
## i.e. rounded to the nearest unit (one unit = 0.1 metres)
## 
## Window: rectangle = [0, 96] x [0, 100] units
## Window area = 9600 square units
## Unit of length: 0.1 metres

Notice that the x variable we created is of class ppp. We see that it includes 71 points and that the average intensity (points/unit area) of these points is .0074 points per square unit or in this case 0.74 trees per square meter.

Unless you acquire magical data, your data will not come to you in ppp format. Working things into ppp (then later sp for advanced plotting) will help you easily apply the complex analyses and functions listed below to your data, so we’re going to work through an example of loading a ’.csv of coordinates and attributes into R, transforming this to a ppp object, then to an sp SpatialPoints object. Let’s look at the location of dumps in the western US:

lf <- read.table("./data/landfill.txt", sep=",", stringsAsFactors = F, skip = 1)
colnames(lf) <- c("id", "lat", "lon")

# 1. store x and y coords in two vectors
lon <- lf$lon
lat <- lf$lat

# 2. create two vectors xrange and yrange with dimensions of triangle that contain all points
xrange <- range(lon, na.rm=T)
yrange <- range(lat, na.rm=T)

# 3. create ppp
lf <- ppp(lon, lat, xrange, yrange)

## Warning: data contain duplicated points

plot(lf)

We can add marks to the ppp object by specifying them when we make the object. Let’s say we have an additional attribute that measures the size of the landfill in square kilometers, stored in a vector called size:

lf <- ppp(lon, lat, xrange, yrange, marks = size)
plot(lf, which.marks = "size")

It’s a good idea to check your data, as usual, for missing data or duplicated data. You can do this by looking for repetitions in a two-column data.frame of your coordinates:

coords <- cbind.data.frame(lon, lat)
any(duplicated(coords))

## [1] TRUE

To remove duplicates from a ppp object, simply do:

lf <- unique(lf)

But just like missing data, it’s important to understand why there are duplicates, so be sure to interrogate your data.

To convert this ppp object to a SpatialPoints object, we can use the following from the maptools package:

lfsp <- as(lf, "SpatialPoints")
class(lfsp)

## [1] "SpatialPoints"
## attr(,"package")
## [1] "sp"

plot(lfsp)

Useful `spatstat` functions

Each ppp object contains the following, indexed by a dollar sign:

n, the number of points
x and y, the x and y coordinates of each point
marks, any attributes of the points
window, which describes the observational window for the data.

There are a number of functions that work well with the spatstat package that allow you to extract useful info from your ppp object, x, such as:

npoints(x)
marks(x)
coords(x)
as.owin(x)
as.data.frame(x)

For the FULL list of things you can do with these objects, see the Baddeley text, particularly p. 62.

Distributions of points

It’s common for researchers to be interested in how points are distributed on a two-dimensional surface. Points may represent sample locations, species events, or instances of disease. For example, as a researcher collecting samples or establishing control points, you may be concerned about the uniformity of these points across space. The patterns of points on maps can be classified as:

Random: any sub-area is as likely to contain a point as any other subarea of the same size, regardless of the subareas location; the placement of one point has no influence on the placement of another point.
Uniform: the density of points in any subarea is equal to the density of points in all other subareas of the same size and shape.
Clustered/clumped: the probability of occurrence of a point varies in some inverse manner with distances to preexisting points.

Image from this source

We can informally inspect the distribution of our point process by plotting the point pattern and observing whether points appear to be random, uniform, or clustered. The quickest way to visually inspect our data is to plot() it, as we did above. Another useful visualization tool is to divide our data into quadrants and count the number of points within each quadrant, like so:

Let’s learn a bit more about how to visually inspect the spatial patterns in our point data by reproducing the plot to compute and visualize a point processes’ intensity.

Intensity

We can think of the intensity of a point process as similar to the sample mean. The intensity is the average density of points (i.e. number of points per unit area). Intensity can be constant (uniform or homogenous) or may vary across space. We can compute the intensity manually by summing the the points in a shapefile over the total area of the shapefile:

dens <- nrow(crime)/area(city) #area from rgeos package
print(paste("The density is:", dens))

## [1] "The density is: 9.61537319061061e-06"

We can visualize how the intensity of crime varies across space by creating a grid (often called quadrats) and counting the number of crimes in each grid cell:

# create raster with the extent of the city
r <- raster(city)

# set raster resolution to 1000 feet
res(r) <- 1000

# the r raster fills the entire bounding box of the city shapefile.  To clip only to the city boundaries, we can do the following:
r <- rasterize(city, r)

# add quadrants by turning raster pixels into polygons
quads <- as(r, 'SpatialPolygons')

# add points dataset
plot(quads)
plot(crime, add=T, col="red", cex=0.2)

We can count the number of crimes in each quadrant and turn this count data into a raster using rasterize():

nc <- rasterize(coordinates(crime), r, fun = 'count', background = 0)
plot(nc)
plot(city, add=T)

The areas in green and yellow are areas with a higher intensity of crime. To mask out areas of the nc raster outside of the city, use the mask() function and the r raster we made above with the extent of the city shapefile. If we fail to mask areas outside of the main region, we will have lots of zeros in our crime frequency data, which will skew our results:

ncrimes <- mask(nc, r)
plot(ncrimes)
plot(city, add=T)

Let’s look at the distribution of crime frequencies in the city. Notice how I index the data in the raster object. You could also use getValues():

hist(ncrimes@data@values, 100, main = "Distribution of crimes in quadrats")

This makes sense given our visualizations of the data. Most quadrats have little crime and a few quadrats have exceptionally high rates of crime. Let’s compute the average number of crimes per quadrat:

mean(ncrimes@data@values, na.rm=T)

## [1] 9.261484

Kernel density

These visualizations and descriptions are cool, but what if we want to estimate a continuous crime desity surface for the city? We can use the spatstat package to do this, but first, we need to convert our SpatialPointsDataFrame crime to a ppp object.

lon <- crime@coords[,1]
lat <- crime@coords[,2]

# 2. create two vectors xrange and yrange with dimensions of triangle that contain all points
xrange <- range(lon, na.rm=T)
yrange <- range(lat, na.rm=T)

# 3. create ppp
crime_ppp <- ppp(lon, lat, xrange, yrange, marks = as.factor(crime$CATEGORY)) #as factor piece important for splitting to plot later

## Warning: data contain duplicated points

plot(crime_ppp, main = "Our crime ppp object")

The plot is a mess, but hooray! We’ve successfully created a marked ppp object. Now let’s create quadrats and plot intensity using spatstat:

q <- quadratcount(crime_ppp, nx = 7, ny = 3)
plot(crime_ppp, cex = 0.5, pch = "+")
plot(q, add=T, cex=2, main = "Groovy quadrat plot")

Quadrats are cool and all, but kernel density is way cooler. The density function creates a kernel density surface which is stored in R as spatstat class im or image. This is how spatstat handles rasters.

ds <- density(crime_ppp)
class(ds)

## [1] "im"

Now let’s plot our density surface:

plot(ds, main = "Crime density")

Wait, what’s happening here? Kernel density estimation is an approach to estimate the probability density function of a random variable. Think of KDE this way: for each event (crime), there’s a mini-function that spreads the effect of this crime across space. A KDE essentially sums all of these mini-functions to create a surface indicative of the density of events across space. KDE estimation here uses an isotropic Gaussian kernel¹ To a certain extent, you can think of a KDE as a 3D, smoothed histogram, however the actual function used in KDE is more complex than binning for a histogram. One of the most important considerations when using KDE is the selection of the functions bandwidth. The bandwidth of a KDE is essentially a parameter in the KDE function that indicates how smooth you want the data to be. Too smooth (high bandwidth) can mask local variations in intensity that are important, and too jagged (low bandwidth) can highlight local anomalies that don’t really reflect the overall spatial pattern. Put another way, a large bandwidth results in a very generalized surface of the event density while a small bandwidth focuses on smaller scale, event-level variation. Kernels are typically smoothed such that the standard deviation of the smoothing kernel is used as the bandwidth (see ?density). You can specify the bandwidth easily in the density() function. Look at ?bandwidth for different options currently available.

Visualization is important, but does not stand up to the rigors of science. It’s important to know how to implement statistical tests to determine whether your data is randomly distributed through space.

Tests of CSR

In the tests that follow, we can think of our “null” hypothesis for the spatial distribution of our points as one of complete spatial randomness (CSR). By comparing our data to this null of randomness, we can determine whether our point process is significantly clustered in space. A CSR process has three basic properties:

The number of points in any region \(A\) has a Poisson distribution with mean \(\lambda * area(A)\).
Given that there are \(n\) points in any region \(A\), the locations of these points is independently and identically distributed (i.i.d., i.e. the location of points is randomly distributed and mutually independent) inside \(A\)
The contents of two disjoint regions \(A\) and \(B\) are independent.

Said in plain Engilsh: under the assumption of CSR, points are independent of each other and have the same likelihood of being found at any location. These assumptions are almost never met; it would be pretty boring to be a researcher if they were! Instead, the location of points is driven by point attributes (marks), other covariates, and random noise.

So I just dropped some stats jargon withour explanation. Shame on me. You’re probably wondering what this Poisson thing is about. No, francophiles, I am not talking about fish. I’m talking about a discrete probability distribution named after Monsieur Simeon Denis Poisson in the early 1800s. Poisson was pretty prolific, as many scholars were back in the day. Today, he is best known for developing a distribution that captures the probability of a number of events occurring within a fixed interval in time and space. Poisson distributions are often used for rare event data (meteors, extreme weather, crime) and for the occurrence of events in space, i.e. our point process data sets. Curious to see what a homogenous Poisson point process (i.e. CSR process) looks like? Me too. With the rpoispp() function in spatstat, I can generate a number of points following CRS:

plot(rpoispp(100), main = "CSR 1")
plot(rpoispp(100), main = "CSR 2")

Each time you run the rpoispp() function, it generates a new CSR process. Notice that it’s hard to detect any real patterns in the point process; no clear clustering, no uniformity - i.e. CSR!

We care less about CSR on its own than about comparing our data to CSR models to say, with statistical confidence, whether or not our data is CSR. Non-randomness, of course, justifies fun exploration of covariates, marks, etc. Yeah, believe it or not, this stuff can be fun. The classic way we compare our data to a CSR process is to compute a \(\chi^2\) test based on quadrat counts. If you forgot how a \(\chi^2\) test works, go peruse your intro stats textbook or watch this. We can compare our data to the null of CSR (i.e. fairly uniform quadrat count) using quadrat.test():

quadrat.test(crime_ppp, nx = 10, ny = 15)

## 
##  Chi-squared test of CSR using quadrat counts
##  Pearson X2 statistic
## 
## data:  crime_ppp
## X2 = 9267.5, df = 149, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 
## Quadrats: 10 by 15 grid of tiles

Hey, look, we have a reallllyyy low p-value. This means we can reject the null hypothesis of CSR, i.e. our crime data has some funky spatial stuff going on! HOORAY! Well, kind of. This is a pretty generic test. It basically just tells us that our data is not a homogeneous Poisson process, but not much else. Also, the power of the quadrat test drops to zero as your quadrats become excessively small or large… so choose wisely and interpret with caution.

A better way to test for CSR is the Kolmogorov-Smirnov test (no vodka connoisseurs, not that Smirnov). You may have already encountered this test as it is pretty widely used to assess goodness-of-fit and/or how well sample data from some unknown population fits some hypothetical model of a specific population. All of the tests you’ll see in the next sections use this basic approach: compare our empirical (sampled) data to a theoretical (modeled) ideal and see if the two are significantly different. In our case, the theoretical model is a CSR processes. We can plot both the empirical and theoretical models together in cumulative form, scaled so their cumulative sums are zero (i.e. express both as cumulative probability distributions). We then look for the greatest difference between the two. The maximum distance is the Kolmogorov-Smirnov statistic, i.e. \(D = max|CDF - EDF|\) where \(CDF\) is the theoretical cumulative distribution function and \(EDF\) is the empirical cumulative distribution function.

With point data, we specify a real function \(T(x,y)\) defined at all locations \(x,y\) in our sampling window. We evaluate this function at each of the data points and compare the empirical values of \(T\) with the predicted distribution of \(T\) under the CSR assumptions. spatstat makes this painless:

ks <- cdf.test(crime_ppp, "x")
plot(ks)

pval <- ks$p.value
pval

## [1] 0

Here, I’m just evaluating my sample data using the x coordinate. You can also input a density function (see below) as a covariate, to estimate overall spatial randomness:

ds <- density(crime_ppp)
k <- cdf.test(crime_ppp, ds)
plot(k)

In all cases, our low (zero) p-values suggest that our crime data data was not drawn from a population distributed according to the assumptions of CSR. We can also see this in our plots - the observed and expected (if our data were CSR) cumulative distributions are pretty different.

Tests of spatial dependence

Testing for the complete spatial randomness of our data will tell us whether or not our point process is random. It won’t, however, tell us how our data is non-random, i.e. are points closer together than a random process or more structured than a random process. It’s useful to be able to test for whether our non-random data is regular (i.e. points tend to avoid each other) or clustered (points like to hang together). The main statistical techniques for answering this question compare theoretical and empirical models of distance between points in a point pattern process. We can think of distance between points in several ways, each of which has its own associated statistic of spatial dependence. Distance can be thought of in terms of:

Nearest neighbor distances (G function), or the distance of each point to its nearest neighbor,
Empty space distances (F function), or the distance from a fixed reference location to the nearest data point,
Pairwise distance (K function), or the distance between all pairs of points.

These are easy to compute in spatstat with the pairdist(), nndist(), and distmap() functions, respectively.

G function: Distance to the nearest event

The \(G\) function measures the distribution of distances from an arbitrary event to its nearest event (i.e. uses nearest neighbor distances). By plotting our empirically estimated \(G\) function against our theoretical expectation if the process were \(CSR\), we can assess the extent to which the empirical distribution of our data is different from the theoretical \(CSR\) expectation.

gtest <- Gest(crime_ppp)
gtest

## Function value object (class 'fv')
## for the function r -> G(r)
## .....................................................................
##         Math.label      Description                                  
## r       r               distance argument r                          
## theo    G[pois](r)      theoretical Poisson G(r)                     
## han     hat(G)[han](r)  Hanisch estimate of G(r)                     
## rs      hat(G)[bord](r) border corrected estimate of G(r)            
## km      hat(G)[km](r)   Kaplan-Meier estimate of G(r)                
## hazard  hat(h)[km](r)   Kaplan-Meier estimate of hazard function h(r)
## theohaz h[pois](r)      theoretical Poisson hazard function h(r)     
## .....................................................................
## Default plot formula:  .~r
## where "." stands for 'km', 'rs', 'han', 'theo'
## Recommended range of argument r: [0, 243.1]
## Available range of argument r: [0, 797.87]

plot(gtest)

So what’s happening here? Gest estimates our theoretical CDF if our process were a homogeneous Poisson point process (blue line and theo in the print out). We can compare this with our actual, empirical CDF, here estimated according to three different approaches to account for edge effects (km, bord, han). Accounting for edge effects is crucial, as points at the edge of the sampling window may be close to points just outside of the window, but are counted as far from other points because of the window boundary. Detailed documentation on how these edge assumptions differ can be found by typing ?Gest. As always, test sensitivity to different assumptions!

For all values of r (distances between points) our empirical function is greater than the theoretical one. This suggests that nearest neighbor distances in the point pattern are shorter than for a Poisson process, suggesting a clustered pattern. If our empirical CDF had been lower than the theoretical one, this would suggest a more regular pattern.

F function: Distance from a point to the nearest event

The \(F\) test measures the distribution of all distances from an arbitrary point to its nearest event (i.e. uses empty space distances). The function is also called the empty space function because it is a measure of the average space left between events.

ftest <- Fest(crime_ppp)
ftest

## Function value object (class 'fv')
## for the function r -> F(r)
## .....................................................................
##         Math.label      Description                                  
## r       r               distance argument r                          
## theo    F[pois](r)      theoretical Poisson F(r)                     
## cs      hat(F)[cs](r)   Chiu-Stoyan estimate of F(r)                 
## rs      hat(F)[bord](r) border corrected estimate of F(r)            
## km      hat(F)[km](r)   Kaplan-Meier estimate of F(r)                
## hazard  hat(h)[km](r)   Kaplan-Meier estimate of hazard function h(r)
## theohaz h[pois](r)      theoretical Poisson hazard h(r)              
## .....................................................................
## Default plot formula:  .~r
## where "." stands for 'km', 'rs', 'cs', 'theo'
## Recommended range of argument r: [0, 814.78]
## Available range of argument r: [0, 814.78]

plot(ftest)

The Fest function also computes several versions of the \(F\) function using different edge effect estimators (cs, bord, km - review package documentation for more info). Inversely to the \(G\) function, empirical values above our theoretical values (blue) indicate that empty space distances in our empirical point pattern are shorter than for a Poisson process, suggesting a regularly spaced pattern. The opposite would indicate a clustered pattern. In plain English? Our data appears to be clustered.

K function: Points witin a certain distance of a point

The \(K\) function measures the number of events found up to a given distance of any particular event (i.e. uses pairwise distances). By comparing estimated values of the \(K\) function to the theoretical ones, we can determine what kind of interaction exists between variables. Think of this function as the expected number of points of the process within in a given distance of a typical point in the process. For a homogeneous Poisson process, there is a constant number of expected points falling within some distance. Again, we compare our estimated empirical \(K\) function with the Poisson \(K\) functions. Empirical values less than our theoretical ones suggest a regular pattern. Empirical values greater than theoretical values suggest clustering.

ktest <- Kest(crime_ppp)
ktest

## Function value object (class 'fv')
## for the function r -> K(r)
## ..............................................................
##        Math.label       Description                           
## r      r                distance argument r                   
## theo   K[pois](r)       theoretical Poisson K(r)              
## border hat(K)[bord](r)  border-corrected estimate of K(r)     
## trans  hat(K)[trans](r) translation-corrected estimate of K(r)
## iso    hat(K)[iso](r)   isotropic-corrected estimate of K(r)  
## ..............................................................
## Default plot formula:  .~r
## where "." stands for 'iso', 'trans', 'border', 'theo'
## Recommended range of argument r: [0, 3476.4]
## Available range of argument r: [0, 3476.4]

plot(ktest)

This test confirms what we saw in our \(G\) and \(F\) test results: our data is clustered.

Marked data

If you are working with point data with multiple attributes, you’re working with marked data. Our crime data, for example, is marked with a number of attributes including type of crime. Below, we’ll be working with a dataset that contains the location of trees as well as tree species, diameter and height. There are a number of things you can do with marked data which I’ll review here. Note that for data to be treated as a marked point process, the points should still be random, i.e. looking at temperature variations recirded at coordinates for weather stations does not count since the point observations are at a fixed set of locations (if this is confusing, refer to the pre-assignment readings). What about core sampling locations? Nope. Normally these are selected by a scientist, i.e. not random. Crime? Ostensibly random. Tree location in forest? As long as they weren’t intentionally planted in specific locations, we can assume these are random. Even with seemingly random data, there are different ways to think about the randomness. From p. 179 in the Baddeley text, these include:

Random labeling - given the locations \(X\), marks are conditionally independent and identically distributed (i.i.d)
Independence of components - the sub-process \(X_m\) for each mark \(m\) are independent point processes
CSR and independence - the locations are a uniform Poisson point process and the marks are i.i.d.

These null hypotheses are each different. It’s important to think through this before you start playing with marked point data.

Categorical marked data

An important note before we start. If your marks are categorical, make sure they are stored as factors in your ppp object.

Ok let’s GO. We’ll start by working with the lansing data set stored in the spatstat package that includes categorical information on tree locations and species.

data(lansing)
summary(lansing)

## Marked planar point pattern:  2251 points
## Average intensity 2251 points per square unit (one unit = 924 feet)
## 
## *Pattern contains duplicated points*
## 
## Coordinates are given to 3 decimal places
## i.e. rounded to the nearest multiple of 0.001 units (one unit = 924 feet)
## 
## Multitype:
##          frequency proportion intensity
## blackoak       135 0.05997335       135
## hickory        703 0.31230560       703
## maple          514 0.22834300       514
## misc           105 0.04664594       105
## redoak         346 0.15370950       346
## whiteoak       448 0.19902270       448
## 
## Window: rectangle = [0, 1] x [0, 1] units
## Window area = 1 square unit
## Unit of length: 924 feet

This data includes 2251 trees of six different species.

plot(lansing)

We can easily plot the point process for each species (stored as a factor) using the split() function:

spp <- split(lansing)
plot(spp[1:4], main = "Tree location by type")

To extract one of the patterns:

hick <- spp$hickory
plot(hick)

We can plot the densities of each type of tree by calling:

plot(density(spp[1:4]), main = "Densities of tree type")

Cool. We can also extract information about the marks of points surrounding a point. R indicates the neighborhood radius to use:

m <- marktable(lansing, R = 0.1)
m[1:10,]

##      mark
## point blackoak hickory maple misc redoak whiteoak
##    1         1      17    17    8     20       17
##    2         1       9    18    5     14       28
##    3         1      10    19    6     17       23
##    4         1      10     0    0      8       10
##    5         2      18     0    0      4       15
##    6         6      32     1    0      9       19
##    7         9      38     1    0     10        8
##    8         7      34     1    0     10       16
##    9         3      32     0    0     10       15
##    10        4      42     0    0     11       15

Continuous marked data

Now let’s move to a data set that contains continuous marked data. The longleaf data set contains the locations and sizes (diameter) of Longleaf Pines. markstat can be used to compute the average value of a mark within some distance of each point in the data set. In this case, what’s the count of each species within a certain distance of each point:

data(longleaf)
md <- markstat(longleaf, mean, N = 5)
md[1:10]

##  [1] 43.40 43.40 48.58 21.70 48.38 53.32 40.28 29.82 24.92 21.70

We can also visually inspect the distribution of our continuous marks:

hist(marks(longleaf))

Comparing two point patterns

To study the relationship between two or more density plots, consider the pairs() function:

pairs(density(split(lansing)[c(2,3,5)]))

This command divides the lansing data set into each tree species. It computed the smoothed kernel intensity estimate for each species and displayed scatterplots of the each species pair. The plots look weird because we’re looking at how two density surfaces interact. Think of this as a 3D covariance matrix. The shapes that don’t look like stingrays (hickory and maple interactions) are indicative of species separation, i.e. hickory and maple are rarely found in the same space since a high density of hickories is associated with a low density of maples).

We can also compare the distance between two point pattern objects \(X\) and \(Y\). crossdist(X,Y) computes a matrix of distances from each point in \(X\) to each point in \(Y\) and ncross(X,Y) finds for each point in \(X\) the nearest point of \(Y\) and the distance to this point.

Point patterns and other covariates

It’s often quite helpful to determine whether the intensity of our point pattern process is driven by other covariates. Could our crime intensity be driven by socioeconomic variables? Could the location of tree species be driven by topography? In both cases, the covarying data is not stored in the ppp object, but has to be brought in (as a raster or shapefile) from another data source. We can start answering very important and interesting questions once we bring in this additional data. Let’s start with a pre-made example from the spatstat package. Let’s say we have data on tropical tree location and slope:

data(bei) #in spatstat package
slope <- bei.extra$grad
b <- quantile(slope, probs = (0:4)/4)
slope_break <- cut(slope, breaks = b, labels = 1:4)
v <- tess(image = slope_break)
plot(v)
plot(bei, add=T, pch = "+")

So what are we doing here? After we load our data, we create an object b that stores sample quantiles corresponding to the given probabilities (probs), in our case, 0%, 25%, 50%, 75%, and 100%, of slope having a certain value. We then use cut() from the raster package to reclassify the raster slope object with values 1 to 4 based on which quantile of data they belong to. The tess() or tessellation function creates a collection of disjoint spatial regions, in this case, each region is one of the four categories in the raster image we have created. We can then count the number of points in each tessellation. If the point pattern process was uniform, we’d expect a near-equal number of points in each slope category (1, 2, 3, 4). We can use the quadratcount() function to count the number of points in each category of slope:

qb <- quadratcount(bei, tess=v)
qb

## tile
##    1    2    3    4 
##  271  984 1028 1321

Looks like there are big differences in tree locations. It appears that trees in this dataset prefer steeper slopes (3 and 4).

This is one way to get a rough understanding of how a covariates relates to a point process. I also like estimating, across a gradient of covariate values, the prevalence of points. Here’s some quick math to explain this. Imagine the intensity of a point process is, in fact, a function of a covariate (i.e. tree density is a function of our slope). Cool! How do we test this empirically? One quick way is to compute a function that tells us how the intensity of points depends on the value of the covariates. At any spatial location \(u\), let \(\lambda(u)\) describe the intensity of the point process and \(Z(u)\) be the value of the covariate (slope, in our case). Then what we are interested in is this relationship:

\[ \lambda(u) = \rho(Z(u)) \] where \(\rho\) is the function of interest. The spatstat function estimates the intensity as a function of a covariate using the rhohat() function:

plot(rhohat(bei, slope))

In ecology, this is often known as a resource selection function if the points are organisms and the covariates are a descriptor of the environment or habitat. If you decide to use this method, then be sure to read about the non-parametric smoothing estimator used to generate this function. The smoother we specify changes the function - as always, it’s crucial you understand what assumptions underlie your analyses and visualizations and that you ensure the approach is suitable for your data set.

Here’s another cool thing we can do. Remember our friend Kolmogorov-Smirnov? We can use this test to determine whether the point-pattern intensity of interest to us depends on a specific covariate. We’ve collected evidence above that suggests that tropical tree location depends on slope. We can formally test this by dividing applying the K-S test using slope as our covariate (instead of coordinates or density as we used above):

ks <- cdf.test(bei, slope)
plot(ks)

NOTE, if you’re working with continuous variables (slope), then KS is your guy. If you’ve got categorial variables, you’ll want to consider the classic \(\chi^2\) test. Again, this test shows that the relationship between slope and tree locations is not random, i.e. there is some significant relationship between the two. If you are interested in exploring how your point data relates to spatial covariates in more detail, I recommend you read the section on Maximum likelihood for Poisson processes in the CS text, starting on pg. 95. Think of this as regression for point processes… and remember, as is always the case, that to do this sort of analysis you have to make fairly strong assumptions about your data.

Additional resources

I’ve frequently referenced the Baddeley text as a fantastic overview of point processes and their analysis in R. I relied heavily on this text for the creation of this tutorial, as well as your required pre-assignment reading in the RSpatial textbook.
Another useful tool which we won’t have time to cover in this class is hotspot mapping. There’s a great package to do this in R. More information can be found here.
If you are interested in modeling the intensity of your point data set using other variables, read Chapter 15 and 16 in the Baddeley text.

If you really get into point pattern analysis, you can change this. For example, you can estimate an adaptive estimator of intensity which uses a fraction \(f\) of the data to construct a Dirichlet tessellation and estimates a constant intensity in each tile of the tessellation.↩

Point pattern analysis

Dr. Emily Burchfield