Basic R

1 Intro to R

1.1 Welcome

1.2 What is R

R is "GNU S", a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, etc

1.3 Why use R?

  • Free & open source
  • Available on nearly every platform
  • Extensible (CRAN)
    • We'll be using base R in this workshop
  • Documentation & community
  • Graphics
  • Nerd cred

1.4 How to use R?

if you have installed Rstudio, then open it else open R

1.5 Rstudio

  • File > open new R script
  • Top left: editor
  • Bottom left: R console
  • Tip: nearly all your commands should be typed in an R script, then sent to the console for evaluation
    • exceptions: install.packages(), help queries

2 Data Types

2.1 types of data

  • What are basic data types?
  • logical, numeric, character
    • also complex and raw, but we'll ignore those

2.2 logical

  • Statement of truth-y-ness
c(TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)
[1]  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE

3 != 4
[1] TRUE

"Democat" %in% c("Democrat", "Independent", "Republican")
[1] FALSE

2.3 logical statements pop quiz

## Statements of truthyness
7 == 2 
3 != 7 
7 >= 2 
2 >= 7 
## And/or
2 == 2 & 2 > 3 
2 == 2 | 2 > 3 

2.4 logical statements, answers

## Statements of truthyness
7 == 2 # FALSE
3 != 7 # TRUE
7 >= 2 # TRUE
2 >= 7 # FALSE
## And/or
2 == 2 & 2 > 3 # FALSE
2 == 2 | 2 > 3 # TRUE

2.5 numeric

  • numeric is umbrella term for "double" and "integers"
  • stores numbers:
(x <- c(1, exp(1), pi, 10384.287459))
[1]     1.000000     2.718282     3.141593 10384.287459

is.numeric(x)
[1] TRUE

2.6 character

  • character type represents letters/words:
(myname <- c("My name is Alex"))
[1] "My name is Alex"

(myname2 <- c("My", "name", "is", "Alex"))
[1] "My"   "name" "is"   "Alex"

length(myname)
[1] 1

length(myname2)
[1] 4

2.7 coercion

  • vectors can have only one data type:
(x <- c("My name", 3 == 4, 7.27))
[1] "My name" "FALSE"   "7.27"

class(x)
[1] "character"

2.8 coercion, continued

  • anything can be coerced to a character
  • logicals can be coerced to numeric
    • TRUE is 1, FALSE is 0

2.9 atomic vectors

  • all the above are called atomic vectors
  • useful to remember this when R yells at you

2.10 lists

  • sometimes we need to store more than one type of data
  • we can do this with a list

2.11 lists, continuted

list(c(1.82, 1940, 93.20, 192.917), 
     c("Beyonce", "Lady Gaga", "Pink"),
     c(TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE))
[[1]]
[1]    1.820 1940.000   93.200  192.917

[[2]]
[1] "Beyonce"   "Lady Gaga" "Pink"     

[[3]]
[1]  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE

2.12 names

  • we can name elements of vectors:
list(n = 10,
     dv = c(1, 3, 5, 7, 9, 19, 92, 4, 10, 4))
$n
[1] 10

$dv
 [1]  1  3  5  7  9 19 92  4 10  4

2.13 dimensions

  • all the vectors we've worked with so far have been single-dimension
  • but we often work with two dimensional data
    • rows are observations
    • columns are variables

2.14 matrix

matrix(c(1, 2, 3, 4, 5, 6), nrow = 2)
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

2.15 data.frame

  • matrix where columns can be different types:
data.frame(x = 1:3,
           y = c("a", "b", "c"), 
           z = c(TRUE, FALSE, TRUE))
  x y     z
1 1 a  TRUE
2 2 b FALSE
3 3 c  TRUE

2.16 data structures

  homogeneous heterogeneous
1d atomic vector list
2d matrix data.frame
nd array  

3 Subsetting

3.1 Subsetting

  • oftentimes, we are interested in subsetting
  • how to refer to a specific column or row?

3.2 [

  • the [ function is how we subset
x <- 1:10

3.3 [ for positive numbers

x[c(1, 7)]
[1] 1 7

3.4 [ for negative numbers

x[-c(1, 7)]
[1]  2  3  4  5  6  8  9 10

3.5 [ for logical statements

x[x > 3]
[1]  4  5  6  7  8  9 10

3.6 [ for 2d data

(dat <- data.frame(x = 1:3,
                  y =  c("a", "b", "c"),
                  z = c(TRUE, FALSE, TRUE)))
  x y     z
1 1 a  TRUE
2 2 b FALSE
3 3 c  TRUE

3.7 [ for specific elements

dat[1, 3]
[1] TRUE

3.8 [ for whole rows/columns

dat[, 1]
[1] 1 2 3

dat[3, ]
  x y    z
3 3 c TRUE

3.9 [

When you have a list, [ always returns a list:

mylist <- list(x = 1:10, y = pi, z = c(TRUE, FALSE))
mylist[2]
$y
[1] 3.141593

3.10 [[ for elements of lists

[[ will return the actual element:

mylist[[2]]
[1] 3.141593

3.11 Subsetting by name

  • we can subset by name so we don't have to remember/figure out positions
dat[, c("x", "y")]
  x y
1 1 a
2 2 b
3 3 c

3.12 Subsetting by name - $

This is common so $ provides a quicker way:

dat[["x"]]
[1] 1 2 3

dat$x
[1] 1 2 3

4 Distributions

4.1 Distributions in R

  1. :BMCOL:
    • R has functions dealing with probability distributions built in
    • They share common prefixes depending on what you want:
  2. :BMCOL:
    What you want prefix
    cdf p
    quantile (inverse cdf) q
    random draw r
    density d

4.2 Common distributions

R's name name
norm normal
unif uniform
t t
binom binomial
weibull weibull
beta beta
hyper hypergeometric
nbinom negative binomial
gamma gamma

5 Conditionals

5.1 conditionals

  • conditional statements:
  • If (this one thing [condition]), then (do this other thing), else (do this different other thing)
  • In R, need to consider whether (condition) is of length 1 or > 1
  • Let's start when (condition) is length one

5.2 if, then, else

x <- 3
if (x == 7) {
  print("x is 7")
} else {
  print("x is not 7")
}
[1] "x is not 7"

5.3 if, then, else with logicals

x <- TRUE
if (x) {
  print("That's true")
} else {
  print("That's false")
}
[1] "That's true"

5.4 conditions with length > 1

  • remember: if, else only works if condition is of length one
x <- 1:10
if (x > 5){
  TRUE
} else {
  FALSE
}
[1] FALSE
Warning message:
In if (x > 5) { :
  the condition has length > 1 and only the first element will be used

5.5 ifelse, continuted

x <- 1:10
ifelse(x > 5, TRUE, FALSE)
[1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

6 Writing Functions

6.1 functions

  • You can write a function in R quite easily
  • Let's say we want to write a function to find the mean

6.2 custom mean function

my_mean <- function(x){
  sum(x) / length(x)
}
my_mean(0:10)
[1] 5

6.3 missing values in custom mean function

my_mean <- function(x, na.rm = FALSE){
  if (na.rm) {
    x <- x[!is.na(x)]
  } 
  sum(x) / length(x)
}
x <- c(1, NA, 3)
my_mean(x, na.rm = TRUE)
[1] 2

7 Using Loops

7.1 loops

  • computers are much better than humans at doing repetitive tasks quickly & without error
  • loops are a common way of doing something similar multiple times
  • we'll talk about for, which loops a prescribed number of times
  • R has while and repeat loops as well, which loop until a logical check fails (returns FALSE)

7.2 for

pseudo-code structure of for:

output <- vector("numeric", length = 72) ## pre-allocate output!
for (something in somevector){           ## defined sequence
   do stuff, referring to each element of ##body
   somevector sequentially with the 
   placeholder something
}

7.3 for loops, example

x <- 6:10
for (i in x) {
  print(i)
}
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

7.4 for loops, example 2

x <- 6:10
y <- vector(length = length(x))
for (i in seq_along(x)) {
  if (i > 1){
    y[i] <- x[i] + y[i - 1]
  } else {
    y[i] <- x[i]
  }
}
## what is y?

7.5 for loops, example 2 answer

y
[1]  6 13 21 30 40

7.6 for loops, example 3

means <- vector()

for (i in names(mtcars)) {
  means[[i]] <- mean(mtcars[[i]])
}

## What will means be?

7.7 for loops, example 3 answer

means
      mpg        cyl       disp         hp       drat         wt       qsec 
20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
       vs         am       gear       carb 
 0.437500   0.406250   3.687500   2.812500

7.8 for loop quiz

## what does this code do?
x <- list.files(pattern = "*.csv")
data <- vector("list", length = length(x))
for (i in x) {
  data[[i]] <- read.csv(i)
}

7.9 notes on looping

  • You may see a lot of advice online against loops
  • They used to be slow in R, not the case anymore
  • So long as you're smart (pre-allocate output length!)

8 The apply Family

8.1 the apply family

  • The apply family of functions make our life easier by applying functions over "stuff"
  • Like a pre-built loop
  • apply, lapply, sapply, vapply, mapply, rapply, tapply
  • We'll look at apply and lapply

8.2 apply

apply(X, MARGIN, FUN)

8.3 apply example

apply(mtcars, 2, mean)
      mpg        cyl       disp         hp       drat         wt       qsec 
20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
       vs         am       gear       carb 
 0.437500   0.406250   3.687500   2.812500

8.4 apply your own functions

Note that we can apply our own functions!

apply(mtcars, 2, my_mean, na.rm = TRUE)
      mpg        cyl       disp         hp       drat         wt       qsec 
20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
       vs         am       gear       carb 
 0.437500   0.406250   3.687500   2.812500

8.5 lapply

lapply(X, FUN) # always returns a list

8.6 lapply example

lapply(mtcars, mean)
$mpg
[1] 20.09062

$cyl
[1] 6.1875

$disp
[1] 230.7219

$hp
[1] 146.6875

$drat
[1] 3.596563

$wt
[1] 3.21725

$qsec
[1] 17.84875

$vs
[1] 0.4375

$am
[1] 0.40625

$gear
[1] 3.6875

$carb
[1] 2.8125

8.7 sapply

  • lapply always returns a list
  • sapply will simplify this (e.g. to a numeric vector) if it can
sapply(mtcars, mean)
      mpg        cyl       disp         hp       drat         wt       qsec 
20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
       vs         am       gear       carb 
 0.437500   0.406250   3.687500   2.812500

9 Putting it all together

9.1 function

Create a function to represent this

\[ f(x) = \left\{ \begin{array}{lr} 0 & \text{for } x < 0 \\ \frac{1}{3} & \text{for } 0 \leq x < 1 \\ \frac{2}{3} & \text{for } 1 \leq x < 2 \\ 0 & \text{for } 2 < x \end{array} \right. \]

9.2 function, answer

myfun <- function(x){
  ifelse(x >= 0 & x < 1, 1 / 3,
         ifelse(x >= 1 & x < 2, 2 / 3, 0))
}

9.3 rejection sampler

using the function from the last slide, construct a function that will return a vector of samples using rejection sampling. Make it take one argument \(n\) the number of samples.

9.4 rejection sampler, answer

myreject <- function(n){
  x <- runif(n, 0, 2)
  y <- runif(n, 0, 2 / 3)
  reject <- y > myfun(x)
  x[!reject]
}

9.5 rejection sampling with multiple n's

  • We want to test the effect of varying n on our rejection sampler.
  • Calculate the mean of the samples from our rejection sampler varying n from 1 to 1,000

9.6 rejection sampling with multiple n, answers

ns <- seq(1, 1000)
means <- sapply(ns, function(n){mean(myreject(n))})
## plot(means)
## summary(means)

Date: October 2017

Author: J. Alexander Branham

Created: 2017-10-20 Fri 14:20

Validate