Basic Set-up

When I first open a .R or .Rmd file in RStudio, my very first step is to tell R where I am on my machine. I do this by clicking Session in the menu at the top of my screen, then Set Working Directory, then To Source File Location. This tells RStudio to set the “working directory”, or the directory where RStudio will look to find files or to save files, to the directory in which my script is located. In this directory, I tend to have a data sub-folder that stores my data for the project. My R script sits outside of this data folder in the main folder, so my directory for each project (or each week of work in your case) looks something like this:

With this file structure, once I set my working directory to Source File Location (which just calls the setwd() function), I can easily load files stored in my data folder without typing the long, annoying directory of the full directory for the data. For example, I can load data in the folder shown above using the following:

my_data <- read.table("./data/precip.txt", sep = ",", header=T)

Here, instead of writing the full directory, e.g. "C:/Users/..." the period (.) basically tells R to input the full working directory of my script, then go into the data folder and open the file called precip.txt. I use this file structure for all of my projects and recommend you do the same. It makes life much easier. If you’re curious about what R sees as your working directory, call getwd() in the console.

read.table() is a function. Functions are always followed by parentheses (()) because they take arguments. In the case of read.table() the arguments include the directory, "./data/precip.txt" and a header argument which tells the read.table() function whether the precip.txt data has a header, i.e. a row with column names. Since precip.txt does have a header, we tell read.table() that header=T or header is TRUE. The sep argument tell the read.table() function that this file is a comma separated file, i.e. each entry is separated with commas rather than spaces or some other symbol.

If you don’t know what arguments a function requires, you can ask R by typing ? and the name of the function in the Console, i.e. ?read.table. This pops up documentation for the function in the Help pane (bottom right of RStudio) that includes a list of arguments to put in the function as well as some examples of the function in use. Note that you don’t have to put ALL the arguments in a function, i.e. in the case of read.table() we don’t specify stringsAsFactors=F, etc… any arguments you don’t specify revert to the default values describe in the function documentation. Only add arguments when you want to override these defaults.

Base R, or the original R interface you installed on your machine includes many functions such as read.table(), mean(), head() and others. The power of R comes from using functions built by other users for specific purposes. These functions have to be installed manually on your machine. These functions live in packages. To install a package, use the install.packages() function. For example, if you want to install the tidyverse suite of packages we’ll use a lot in this class, you’d type the following in your console:

install.packages("tidyverse")  # be sure to use quotes

This installs the tidyverse on your machine. To use a package in a session in R, you need to load that package into the current session using the library() function. For example, if you want to access functions in the tidyverse package, you’d need to load that package at the head of your RMarkdown or R script:

library(tidyverse) # no quotes needed

Did you notice the # no quotes needed comment? You can add comments anywhere in your code by placing a hash tag, #, in front of text:

# this is a comment, it will not run as code
# anything without a leading hashtag will run as code, i.e.
print("HEY GUYS")
## [1] "HEY GUYS"

Variables

The simplest data form in R is a variable. A variable stores a single value (though in the future we’ll refer to columns in a spreadsheet as variables). Let’s make some variables:

x <- 1
y <- 2

We’ve made two variables, x and y. When you run this code, you’ll see these variables pop up in your Global Environment pane in the top right of your screen.

Variable names are case sensitive (so X is different from x) and cannot be numbers. The <- symbol is meant to be an arrow pointing to the right. You can also use an equals sign, so x <- 1 is the same as x = 1. The <- is used my many folks in the R community (including myself) for historical and habitual reasons.

What happens if you run the following:

x <- y
print(x)
## [1] 2

We have replaced the original value of x, 1, with the value of y, 2. When you assign new values to existing variables, you overwrite the original value, so be careful!


Vectors

Vectors are multiple elements stored in a single object. For example:

x <- c(1, 1, 3, 5)
print(x)
## [1] 1 1 3 5

x is a vector of numbers. We know this first because we can see it’s composed of numbers but also because the class() (or data type) of the new vector is numeric:

class(x)
## [1] "numeric"

We can also make vectors that contain character data, or strings:

y <- c("Hey", "Hola", "Bonjour", "Yo")
print(y)
## [1] "Hey"     "Hola"    "Bonjour" "Yo"
class(y)
## [1] "character"

The vector is of class character, a class used to indicate text.

We can index vectors using square brackets and numbers indicating the location of an element. For example, if we want to know the third element in the y vector, we’d do the following:

y[3]
## [1] "Bonjour"

What if you want to create a vector of both character and numeric data? Short answer is you can’t. This is where lists come in. A list can store lots of different types of data, e.g:

my_list <- list("Hey", 1, 2, "Hola")
print(my_list)
## [[1]]
## [1] "Hey"
## 
## [[2]]
## [1] 1
## 
## [[3]]
## [1] 2
## 
## [[4]]
## [1] "Hola"

We can even make lists of lists:

my_list2 <- list(1, 3, "Dog")
big_list <- list(my_list, my_list2)
print(big_list)
## [[1]]
## [[1]][[1]]
## [1] "Hey"
## 
## [[1]][[2]]
## [1] 1
## 
## [[1]][[3]]
## [1] 2
## 
## [[1]][[4]]
## [1] "Hola"
## 
## 
## [[2]]
## [[2]][[1]]
## [1] 1
## 
## [[2]][[2]]
## [1] 3
## 
## [[2]][[3]]
## [1] "Dog"

If we want to extract the first list from our big_list, we’d call:

big_list[[1]]
## [[1]]
## [1] "Hey"
## 
## [[2]]
## [1] 1
## 
## [[3]]
## [1] 2
## 
## [[4]]
## [1] "Hola"

If we want the second element of the first list, we’d call:

big_list[[1]][[2]]
## [1] 1

We won’t work a lot with lists in the class, but they are great to know about. Instead we’ll work quite a bit with data.frames, which I’ll talk about next.


data.frame

data.frames are another fantastic tool to organize lots of different types of data, and in my opinion are more intuitive than lists. They work a bit like an Excel spreadsheet, with columns indicating different variables and rows indicating observations. Columns can be different data types, so we can zip a vector of string data and a vector of numeric data into a single data.frame. Let’s squish our x and y vectors into a data.frame. The easiest way to do this is using the data.frame() function:

df <- data.frame(x, y)
print(df)
##   x       y
## 1 1     Hey
## 2 1    Hola
## 3 3 Bonjour
## 4 5      Yo

Cool! You can also view df in a separate window by typing View(df) in the Console. If you want more control over the shape of your new data.frame, check out cbind.data.frame() and rbind.data.frame().

You can index a data.frame using square brackets. Say you want the element in row 1, column 2? We can index this as follows:

df[1,2] # python users, not that this is row column NOT column row indexing
## [1] Hey
## Levels: Bonjour Hey Hola Yo

We can also index using the dollar sign:

df$x
## [1] 1 1 3 5

This selects the entire column x.

Finally, we can index using the actual column names rather than the index number:

df[,"x"]
## [1] 1 1 3 5

Nice! Here are a few other useful data.frame hacks… to get the list of unique elements in a column (or vector):

unique(df$x)
## [1] 1 3 5

To get the dimensions of a data.frame try these:

length(df$x)
## [1] 4
nrow(df)
## [1] 4
ncol(df)
## [1] 2
dim(df) # row, column
## [1] 4 2

data.frames are rad and we’ll spend most of the class working with these. We’ll use dplyr to do most of our indexing of data.frames, but subset() is a useful function to know about. Say you want to find the rows in df where x is equal to one:

subset(df, x == 1)
##   x    y
## 1 1  Hey
## 2 1 Hola

Why the ==? The == is like a TRUE/FALSE statement… for example:

z <- 1
z == 1
## [1] TRUE

We can read z == 1 like the question “Is z equal to one?”. It is, so R returns TRUE. On the other hand:

z == 2
## [1] FALSE

When using subset() (and dplyr further down the line), we’ll use the == for filtering. Just be sure to remember that == tests for relationships, normally returning TRUE and FALSE. = assigns values, in this case replacing v with v2.


Real data!

Let’s go back to our precipitation data and use our new data.frame skillz to explore this dataset:

my_data <- read.table("./data/precip.txt", sep = ",", header=T)

We can look at the full data.frame by typing View(my_data) in the console. We can look at the first few rows using the head() function and the final rows with the tail() function:

head(my_data)
##      ID                 NAME   LAT    LONG ALT  JAN FEB MAR APR MAY JUN JUL
## 1 ID741         DEATH VALLEY 36.47 -116.87 -59  7.4 9.5 7.5 3.4 1.7 1.0 3.7
## 2 ID743  THERMAL/FAA AIRPORT 33.63 -116.17 -34  9.2 6.9 7.9 1.8 1.6 0.4 1.9
## 3 ID744          BRAWLEY 2SW 32.96 -115.55 -31 11.3 8.3 7.6 2.0 0.8 0.1 1.9
## 4 ID753 IMPERIAL/FAA AIRPORT 32.83 -115.57 -18 10.6 7.0 6.1 2.5 0.2 0.0 2.4
## 5 ID754               NILAND 33.28 -115.51 -18  9.0 8.0 9.0 3.0 0.0 1.0 8.0
## 6 ID758        EL CENTRO/NAF 32.82 -115.67 -13  9.8 1.6 3.7 3.0 0.4 0.0 3.0
##    AUG SEP OCT NOV DEC
## 1  2.8 4.3 2.2 4.7 3.9
## 2  3.4 5.3 2.0 6.3 5.5
## 3  9.2 6.5 5.0 4.8 9.7
## 4  2.6 8.3 5.4 7.7 7.3
## 5  9.0 7.0 8.0 7.0 9.0
## 6 10.8 0.2 0.0 3.3 1.4
tail(my_data)
##          ID                  NAME   LAT    LONG  ALT JAN FEB MAR APR MAY JUN
## 451 ID42173       HUNTINGTON-LAKE 37.23 -119.21 2140 165 160 151  84  32   9
## 452 ID42992            TWIN-LAKES 38.70 -120.03 2438 210 173 169 104  52  29
## 453 ID43093 BISHOP-CREEK-INTAKE-2 37.25 -118.58 2485  57  42  33  21  15  11
## 454 ID43574              GEM-LAKE 37.75 -119.13 2734  86  70  71  41  19  15
## 455 ID43616          LAKE-SABRINA 37.21 -118.61 2763  75  59  51  33  16  10
## 456 ID43770           ELLERY-LAKE 37.93 -119.23 2940 110  85  77  40  23  16
##     JUL AUG SEP OCT NOV DEC
## 451   5   6  28  48 117 132
## 452  15  24  35  72 183 191
## 453  11  15  17  12  37  45
## 454  16  19  23  24  70  77
## 455  11  12  20  18  50  58
## 456  21  20  23  35  91  96

We can index a column using the $, i.e. my_data$ID.

If you look at my_data in the Global Environment and click on the blue arrow, you’ll see lots of information about each column in the dataset, including the class (e.g. Factor, num, chr), the number of elements (rows) in each vector, and some examples of what the content of each vector looks like. Wait, what the heck is a factor?

Factors are frequently used to store categorical data like identifiers and groupings. For example, imaging you’re working with Census data and have a data.frame with county-level data grouped into states. The State column should be stored as a factor since it groups the county-level data. If there is an error in the data set and North Carolina is spelled as Norht Carolina one time, the incorrectly spelled NC will be a whole separate factor. I think of factors as calling the unique() function on a vector and storing each unique entry as a level. The utility of factors will become more apparent as we move through the class. In our dataset, ID and NAME are stored as factors. If we didn’t want them to be factors and instead wanted them to be character vectors, we could add the stringsAsFactors=F argument to read.table().

my_data <- read.table("./data/precip.txt", sep = ",", header=T, stringsAsFactors = F)

You can change and/or add columns to a data.frame as follows:

my_data$ALT10 <- my_data$ALT + 10  # creates a new column called ALT10 that is ALT plus 10
my_data$LAT <- NA # replaces ALL values of the LAT column with NA, be careful when you overwrite columns!!
my_data$JUNJUL <- my_data$JUN + my_data$JUL # new column that sums precipitation in June and July

R also makes it VERY easy to quickly visualize your data. We’ll learn how to make visualizations with ggplot2 in this class, but you can also make pretty sweet visualizations with base R.

# histogram of precip in June
hist(my_data$JUN, main = "JUNE TEMPS")

# visualization of the relationship between June precip and altitude
plot(my_data$JUN, my_data$ALT, xlab = "June temperature", ylab = "Altitude", main = "Temp and elevation")


More on data types…

Many of the errors you encounter as you code will be associated with mismatch between the data type you’re working with and the data type the function/tools you’re working with would like to have. If you feed a character dataset to a function like mean() that’s expecting a vector of numbers, R will become confused:

my_data <- c("Hey", "there", "dude")
mean(my_data)
## Warning in mean.default(my_data): argument is not numeric or logical: returning
## NA
## [1] NA

It’s good practice to check to see how R understands your data - how does R understand the my_data list I just made?

class(my_data)
## [1] "character"

The class() function returns information about how R is understanding the my_data list. It says, “Hey, I think this list is a bunch of character entries” - which sounds right, since the character class is indicative of letters, words, and text - also called strings. You can also ask R what the mode of the list is:

mode(my_data)
## [1] "character"

In this case, mode() also returns character. The difference between the mode and class of an object in R is subtle and not essential to this class, but here’s a basic explanation, taken from this great overview:

Let’s look at a few examples to really understand data types in R. OK, let’s start by making a variable called x that is equal to 5:

x <- 5
print(x)
## [1] 5

Now let’s create a vector called v that is equal to three numbers, 5, 10, 20:

v <- c(5, 10, 20)
print(v)
## [1]  5 10 20

c(), short for concatenate, basically zips the three numbers into a vector. We could also do this:

x <- 5
y <- 10
z <- 20
v2 <- c(x,y,z)
print(v2)
## [1]  5 10 20

Are the two vectors the same?

v == v2
## [1] TRUE TRUE TRUE

Each element (number) in these vectors are equal, TRUE. What happens if we only use ONE equals sign?

v = v2

CAREFUL. This replaces the values in v with the values of v2. Remember that == tests for relationships, normally returning TRUE and FALSE. = assigns values, in this case replacing v with v2.


Additional Resources