Introduction to Data Analysis and Reporting with R
This is an html page so that it is easier to search and copy/paste. It won't include the figures and tables but should include all the relevant code.1 Course Information
1.1 Course Information
- We’ll cover tools that should be helpful in nearly any analysis
- Graphing, data manipulation, etc
- We won’t cover specialized, specific tools. But you should get a good enough understanding of how R works to be able to teach yourself these
1.2 Outline
- What is R?
- Graphics
- Basic R
- Data manipulation
- Reporting (time permitting)
2 What is R?
2.1 What is R?
- This is a course about R… mais qu’est-ce que c’est?
- “R is a language and environment for statistical computing and graphics”
- Derived from S, designed at Bell Laboratories
- S first appeared in 1976!
- R is a language … so be prepared for it to hurt a bit to learn!
2.2 Advantages of R
- Free
- Open-source
- Available on nearly every platform
- Extensible via packages — CRAN has over 10,000
- Great community
2.3 Running code
- How to use this R thing?
- If you have R and Rstudio installed, open Rstudio.
- You should see three panes.
- We’ll focus for now on the console, which is on the left and should look something like this:
2.4 The console
R version 3.4.0 (2017-06-15) -- "You Stupid Darkness" Copyright (C) 2017 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) [ ... ] Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. >
2.5 R is a big giant calculator
- R can do math
- Really, really fancy math
- Try typing
3 + 3
in the console - After pressing enter, R will return
6
- R understands the order of operations
3 + 3 * 9
is different from(3 + 3) * 9
2.6 A quiz
- Time for a quiz!
- What’s 7 times 149?
- What’s the square root of the previous answer?
- Tip: You can hit the up arrow to get whatever you entered last
2.7 Answers
7 * 149
[1] 1043
(7 * 149) ^ (1 / 2)
[1] 32.29551
2.8 Packages
- At this point, please install a few packages. You’ll need an internet connection.
install.packages(c("tidyverse", "gapminder"))
- If you have already installed some packages, make sure they’re up-to-date:
update.packages()
- If you have already installed some packages, make sure they’re up-to-date:
- Tip: just type
ins
then hit TAB for tab-completion - Don’t worry about what is going on here, I’ll explain it later.
- Depending on your exact setup, R may ask you a few questions about using a personal library. Do so.
- If you get an error, make sure you can access the internet (https://cloud.r-project.org in particular)
2.9 R scripts
- While those packages are installing, let’s go ahead and open up an R script.
- Allows you to save code so it doesn’t disappear into the ether
- If using Rstudio, File, new file, R script (or Ctrl+shift+n)
- Tip: can send a line from R script to console for evaluation using ctrl+enter
- Strongly recommend that you type into a script and use a keyboard shortcut to evaluate code
- Easier to edit & rerun
- Allows you to save code
- You may make comments
## This adds 3 + 3 3 + 3 3 * 2 # same
3 Graphics in R
3.1 Data Analysis with R
- We need some data to work with
- We’re going to use some data that comes with the
gapminder
package you just installed
- To access the data, you need to load it into memory:
library(gapminder)
3.2 Exploring our data
gapminder
is a data.frame
- Can get a sense of what it looks like with some functions
- Let’s get a sense of what
gapminder
has:
View(gapminder)
head(gapminder)
country continent year lifeExp pop gdpPercap 1 Afghanistan Asia 1952 28.801 8425333 779.4453 2 Afghanistan Asia 1957 30.332 9240934 820.8530 3 Afghanistan Asia 1962 31.997 10267083 853.1007 4 Afghanistan Asia 1967 34.020 11537966 836.1971 5 Afghanistan Asia 1972 36.088 13079460 739.9811 6 Afghanistan Asia 1977 38.438 14880372 786.1134
3.3 Descriptive statistics
- R has lots of built-in functions for getting a sense of the data.
- Try running
summary(gapminder)
- What’s the average life expectancy?
summary(gapminder)
country continent year lifeExp pop Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60 Min. :6.001e+04 Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20 1st Qu.:2.794e+06 Algeria : 12 Asia :396 Median :1980 Median :60.71 Median :7.024e+06 Angola : 12 Europe :360 Mean :1980 Mean :59.47 Mean :2.960e+07 Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85 3rd Qu.:1.959e+07 Australia : 12 Max. :2007 Max. :82.60 Max. :1.319e+09 (Other) :1632 gdpPercap Min. : 241.2 1st Qu.: 1202.1 Median : 3531.8 Mean : 7215.3 3rd Qu.: 9325.5 Max. :113523.1
3.4 Graphics in R
- Let’s start making graphs
- This is the fun part!
- We’re going to rely on the `ggplot2` package, which we installed earlier (as a part of the tidyverse package)
- “The Grammar of Graphics”
- load it up with
library(ggplot2)
3.5 Our question
What’s the relationship between wealth (gdp) and average life expectancy?
- Scatterplot is a good way to get started looking at data!
3.6 ggplot2
- Use the
ggplot()
function to start a plot. - The first argument is to tell it the data
- Tip: use
?ggplot
to look at the help page, where you can see the names of the arguments
ggplot(data = gapminder) # Please use gapminder data
3.7 geom_point
ggplot()
by itself is pretty useless, it just starts a plot- We then have to tell
ggplot
what to draw! - Tip:
?geom_point
ggplot(data = gapminder) + geom_point(mapping = aes(x = gdpPercap, # Put gdp on x axis y = lifeExp)) # Put lifeExp on y
3.9 Fix that x axis!
- Is there a better way to show this relationship?
ggplot(data = gapminder) + geom_point(mapping = aes(x = log(gdpPercap), # Log x-axis y = lifeExp))
3.11 Aesthetics
ggplot()
creates a coordinate system
- You can then add one or more layers to this to create a plot
- We just added the
geom_point()
layer, which used thex
andy
aesthetics (aes
) to add a layer of points to our plot
- We can add more information to the aesthetics to convey more information like color, shape, and size.
- Example: What if we want to convey info about relationship between wealth and life expectancy by continent?
- One solution: add color by continent
3.12 Color
ggplot(data = gapminder) + geom_point(mapping = aes(x = log(gdpPercap), y = lifeExp, ## colour for the Brits color = continent))
3.14 Multiple aesthetics - color & shape
- Of course, some people are colorblind, and others don’t print things in color, so may be nice to use something like shape in addition:
ggplot(gapminder) + geom_point(aes(x = log(gdpPercap), y = lifeExp, color = continent, shape = continent))
3.16 More about aesthetics
- There are more aesthetic mappings
- Try
size
, andalpha
(transparency) for yourself
- You can set aesthetics directly by mapping the aesthetic to a value outside the call to
aes()
- For example, we may want to make the dots slightly transparent to avoid overplotting
3.17 Aesthetics not mapped to variable
ggplot(data = gapminder) +
geom_point(mapping = aes(x = log(gdpPercap),
y = lifeExp,
color = continent),
alpha = 0.5)
3.19 Facets
- So we can use aesthetics to add variables to our graph like
color
. - We might also want to add variables by splitting up the graph based on values of another variables — e.g. subfigures
- If we want to use just one variable, use
facet_wrap()
ggplot(data = gapminder) +
geom_point(mapping = aes(x = log(gdpPercap),
y = lifeExp)) +
facet_wrap( ~ continent, nrow = 2)
3.21 Facets with two variables
- ggplot can facet with two variables with one by row and the other by column
- Use
facet_grid(row ~ column)
to do so - Our
gapminder
data aren’t very well suited for this, but you could do something like:
ggplot(data = gapminder) + geom_point(mapping = aes(x = log(gdpPercap), y = lifeExp)) + ## year >= 2000 will be TRUE or FALSE; ## we'll learn more about logical statements later on: facet_grid(year >= 2000 ~ continent)
3.23 ggplot
- Review of what we’ve learned so far:
ggplot()
creates a blank coordinate system
aes()
helps us map variables to visual properties (x/y location, color, shape, etc)
facet_wrap()
andfacet_grid()
help us convey variables via subfigures
- But what about plots other than the scatterplot?
3.24 geoms
- A
geom
(geometrical object) isggplot
’s way of representing data
- We’ve been using
geom_point()
to represent data as points, e.g. a scatterplot - A
geom
is (usually) the thing we call the plot - line plots, bar plots, boxplots, etc
- Let’s plot the same relationship between wealth and life expectancy but using
geom_smooth()
rather thangeom_point()
:
ggplot(data = gapminder) + geom_smooth(mapping = aes(x = log(gdpPercap), y = lifeExp))
3.26 geoms
- Hey, that last plot looked pretty linear
- We can use OLS instead:
ggplot(data = gapminder) +
geom_smooth(mapping = aes(x = log(gdpPercap),
y = lifeExp),
method = "lm")
3.28 geoms and aesthetics
- Note that different aesthetics are available for different geoms
- So while
linetype
didn’t really make sense for our scatterplot, it makes total sense for a line:
ggplot(data = gapminder) +
geom_smooth(mapping = aes(x = log(gdpPercap),
y = lifeExp,
color = continent,
linetype = continent),
method = "lm")
3.30 multiple geoms
- To add multiple geoms, just add them one after the other:
ggplot(data = gapminder) + geom_smooth(mapping = aes(x = log(gdpPercap), y = lifeExp)) + geom_point(mapping = aes(x = log(gdpPercap), y = lifeExp))
3.32 inherit aes
- Instead of retyping the
aes
mapping, we can specify a set of defaults in theggplot()
call, and overwrite (or add) then in eachgeom
call:
ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) + geom_smooth() + geom_point(mapping = aes(color = continent))
3.34 Review
- ggplot2 provides a very flexible way to make high-quality graphics
- stuff we didn’t look at:
- Lots of different geoms
- Changing scales
- Position
- How to save to include in your paper (later, I promise!)
4 Basic R
4.1 Basics
- We skipped all of this because plotting is more fun & I wanted to start with something fun
- Let’s talk about basic R
4.2 Calculator
- Remember R can be a calculator:
3 * 3 + 29 ^ 4 + 7
[1] 707297
- But R doesn’t “remember” the answer to that anywhere
- You must assign the output to an object in order for R to remember it:
x <- 3 * 3 + 29 ^ 4 + 7 my_name <- "Alex Branham"
- Tip: In Rstudio, use alt+- (option+-) to get
<-
4.3 Wait, what?
- Yeah, I just assigned letters to an object
- We can inspect the contents of an object by typing it into the R console:
x
[1] 707297
- Here, type
my_
then hit tab to have autocompletion
my_name
[1] "Alex Branham"
4.4 +
- If you forgot the closing
"
—my_name <- "Alex Branham
- The R prompt will change from
>
to+
- This indicates that R is waiting for you.
- Cancel by mashing
ESC
4.5 R is pedantic
- You have to be really specific with R:
x
[1] 707297
X
Error: object 'X' not found
my_nam
Error: object 'my_nam' not found
4.6 Things don’t happen magically
x
[1] 707297
x / 1000
[1] 707.297
x
[1] 707297
4.7 Missing values
- Missing data is represented by
NA
in R - R thinks about this as “something that’s there, but whose value we do not know”
- Missingness propagates
mean(c(1, 2, NA))
[1] NA
4.8 Missingness quiz
- What will be the result?
- We’ll learn more about logical statements in a bit, this asks “Is 3 equal to NA”?
3 == NA NA == NA
4.9 Missingness quiz answer
3 == NA
[1] NA
NA == NA
[1] NA
4.10 Functions
- Functions in R can take zero or more arguments
function(arg1 = object1, arg2 = object2, arg3 = object3)
my_vector <- seq(from = 1, to = 10, by = 1) my_vector
[1] 1 2 3 4 5 6 7 8 9 10
mean(x = my_vector)
[1] 5.5
4.11 Functions, continued
my_vector <- c(1, 2, 3, NA, NA, NA, 3, 2, 1) mean(x = my_vector)
[1] NA
mean(x = my_vector, na.rm = TRUE)
[1] 2
4.12 Function arguments
- You don’t have to specify argument names if you type them in order.
- Since
x
is the first argument ofmean()
, no need to typemean(x = my_vector)
- Instead, can just type
mean(my_vector)
- This cuts down on the amount you have to type
4.13 Data
- OK, so now we know how to assign stuff and functions
- Let’s learn about how R thinks about data
- “data” here doesn’t have to mean data from e.g. a survey
- R cares about the class (type) of data and its dimension(s)
4.14 Data types
- We’ll discuss the four most common data types:
- Numeric
- Logical
- Character
- Factor
- We’ll also cover
NA
4.15 Numeric
- Numeric is how R thinks about numbers!
- These can also be called “integer” (if round numbers) or “double”
class(c(1, 2, 3))
[1] "numeric"
sum(c(1, 2, 3))
[1] 6
class(sum(c(1, 2, 3)))
[1] "numeric"
4.16 Logical
- Logical can take two values —
TRUE
orFALSE
- This is useful for dummy variables and tests
1:10 > 5
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
4.17 Character
- Characters represent text
- Sometimes these are called “strings”
c("This", "vector", "is", "of", "length", "what?") c("How about this one?")
4.18 Factor
- Factors are how R thinks about categorical variables
- We already worked with these when we used the
continent
variable fromgapminder
head(gapminder$continent)
[1] Asia Asia Asia Asia Asia Asia Levels: Africa Americas Asia Europe Oceania
4.19 Data type quiz
What type of data are the following?
182 c("My name is Alex") "TRUE" FALSE c(1, 2, 3) c(1, "Alex", TRUE)
4.20 Data dimensions
What’s the difference?
[1] 1 2 3 4 5 6
[,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6
- Data can have dimensions
- Numeric, logical, character, and factors are single dimensions (so are lists)
- That matrix is a 3 by 2 matrix
- Why might we want to have two-dimensional data?
4.21 The data.frame
- Matrices must have the same type, but we can mix and match types with a
data.frame
- Remember
gapminder
from earlier? - We used a
data.frame
to store columns with different data types
- We can access (index, subset)
data.frame
objects using notation similar to matrix notation:
4.22 Indexing data.frame
gapminder[2, 1] # get whatever is in the second row, 1st col gapminder[1, ] # get the first row (all) gapminder[, 1] # get the first col (all) gapminder[, "country"] # select by name gapminder$country # slightly different
4.23 Review
- What we learned
- Missingness propagates
- Functions & arguments
- Basic vectors: numeric, logical, character, factor
- Dimensions & the data.frame
5 Data import & manipulation
5.1 Importing data
- Importing data in R is either trivially easy (usually) or super specific and difficult (rarely), so we won’t actually be doing this
- R has a lot of build in functions:
read.csv()
,read.table()
, etc - Packages provide still more:
readr::read_csv()
,haven::read_dta()
, etc
- I prefer the
rio
package because I don’t have to think - Always gives you a
data.frame
:
library(rio) csv_data <- import("file.csv") stata_data <- import("file.dta")
5.2 Working directories & project structure
- R has the concept of a “working directory”
- You can see where this is by typing
getwd()
into the console - I like to store data and code in separate folders:
- Tip: Rstudio can manage “projects” that take care of a lot of this
5.3 Simple project structure
my-paper-project/ |--- code/ | |--- my-script.R | |--- my-alt-script.R |--- data/ | |--- awesome-data.csv |--- output/ | |--- figure1.eps | |--- figure2.eps | |--- table1.tex | |--- table2.tex |--- my-paper.tex
5.4 Relative paths
- If you have code like that, you need to know what a relative path is so that code in your
code/
directory can load data in yourdata/
directory!
- So if we’re running a file from
code/
(that’s the working directory), we can load data by doing:
my_awesome_data <- import("../data/awesome_data.csv")
- Two dots
..
says “go up one directory”, we could chain them to go up two:../..
5.5 dplyr
- We are going to use dplyr, another package you’ve installed, to help us transform data
filter()
drops rows based on columnsselect()
selects columns
mutate()
creates new variablessummarize()
return statistics
group_by()
allows us to do the above by groups
-These functions take data as the first argument and always return a data.frame1
library(dplyr)
5.6 filter
filter()
uses logical statements (that are TRUE) to return rows:
filter(gapminder, continent == "Asia") filter(gapminder, continent == "Asia" & year >= 2000) filter(gapminder, continent == "Asia" & year != 2000) filter(gapminder, continent == "Asia" | year == 2000)
5.7 Quiz
- Use filter to return all the rows containing observations from Asia or Africa
filter(gapminder, continent == "Asia" | continent == "Africa") filter(gapminder, continent %in% c("Asia", "Africa"))
5.8 select
- The
select
function selects one or more columns:
select(gapminder, country) select(gapminder, country, year, continent) select(gapminder, -continent)
- several helper functions (e.g.
starts_with
), see?select
for examples
5.9 mutate
- Mutate creates new variables:
mutate(gapminder, gdp = pop * gdpPercap)
# A tibble: 1,704 x 7 country continent year lifeExp pop gdpPercap gdp <fctr> <fctr> <int> <dbl> <int> <dbl> <dbl> 1 Afghanistan Asia 1952 28.801 8425333 779.4453 6567086330 2 Afghanistan Asia 1957 30.332 9240934 820.8530 7585448670 3 Afghanistan Asia 1962 31.997 10267083 853.1007 8758855797 4 Afghanistan Asia 1967 34.020 11537966 836.1971 9648014150 5 Afghanistan Asia 1972 36.088 13079460 739.9811 9678553274 6 Afghanistan Asia 1977 38.438 14880372 786.1134 11697659231 7 Afghanistan Asia 1982 39.854 12881816 978.0114 12598563401 8 Afghanistan Asia 1987 40.822 13867957 852.3959 11820990309 9 Afghanistan Asia 1992 41.674 16317921 649.3414 10595901589 10 Afghanistan Asia 1997 41.763 22227415 635.3414 14121995875 # ... with 1,694 more rows
5.10 summarize
summarize
(orsummarise
if you prefer) creates summary statistics:
summarize(gapminder, mean_life = mean(lifeExp))
# A tibble: 1 x 1 mean_life <dbl> 1 59.47444
- Though whoop-de-doo, we could’ve just done
mean(gapminder$lifeExp)
to get that!
- Much more useful if we do this by groups
5.11 group_by
- All the functions we just learned can be performed by groups!
- This is really exciting and makes life much easier
- Calculate mean life expectancy by year:
summarize(group_by(gapminder, year), mean_life = mean(lifeExp)) ## Or, to add it to the data: mutate(group_by(gapminder, year), year_mean_life = mean(lifeExp))
# A tibble: 12 x 2 year mean_life <int> <dbl> 1 1952 49.05762 2 1957 51.50740 3 1962 53.60925 4 1967 55.67829 5 1972 57.64739 6 1977 59.57016 7 1982 61.53320 8 1987 63.21261 9 1992 64.16034 10 1997 65.01468 11 2002 65.69492 12 2007 67.00742 # A tibble: 1,704 x 7 # Groups: year [12] country continent year lifeExp pop gdpPercap year_mean_life <fctr> <fctr> <int> <dbl> <int> <dbl> <dbl> 1 Afghanistan Asia 1952 28.801 8425333 779.4453 49.05762 2 Afghanistan Asia 1957 30.332 9240934 820.8530 51.50740 3 Afghanistan Asia 1962 31.997 10267083 853.1007 53.60925 4 Afghanistan Asia 1967 34.020 11537966 836.1971 55.67829 5 Afghanistan Asia 1972 36.088 13079460 739.9811 57.64739 6 Afghanistan Asia 1977 38.438 14880372 786.1134 59.57016 7 Afghanistan Asia 1982 39.854 12881816 978.0114 61.53320 8 Afghanistan Asia 1987 40.822 13867957 852.3959 63.21261 9 Afghanistan Asia 1992 41.674 16317921 649.3414 64.16034 10 Afghanistan Asia 1997 41.763 22227415 635.3414 65.01468 # ... with 1,694 more rows
5.12 group_by
, continued
- Calculate change in life expectancy by country:
mutate(group_by(gapminder, country), life_change = lifeExp - lag(lifeExp))
# A tibble: 1,704 x 7 # Groups: country [142] country continent year lifeExp pop gdpPercap life_change <fctr> <fctr> <int> <dbl> <int> <dbl> <dbl> 1 Afghanistan Asia 1952 28.801 8425333 779.4453 NA 2 Afghanistan Asia 1957 30.332 9240934 820.8530 1.531 3 Afghanistan Asia 1962 31.997 10267083 853.1007 1.665 4 Afghanistan Asia 1967 34.020 11537966 836.1971 2.023 5 Afghanistan Asia 1972 36.088 13079460 739.9811 2.068 6 Afghanistan Asia 1977 38.438 14880372 786.1134 2.350 7 Afghanistan Asia 1982 39.854 12881816 978.0114 1.416 8 Afghanistan Asia 1987 40.822 13867957 852.3959 0.968 9 Afghanistan Asia 1992 41.674 16317921 649.3414 0.852 10 Afghanistan Asia 1997 41.763 22227415 635.3414 0.089 # ... with 1,694 more rows
5.13 group_by
, continued
- You can group by multiple variables
summarize(group_by(gapminder, continent, year), mean_life = mean(lifeExp))
# A tibble: 60 x 3 # Groups: continent [?] continent year mean_life <fctr> <int> <dbl> 1 Africa 1952 39.13550 2 Africa 1957 41.26635 3 Africa 1962 43.31944 4 Africa 1967 45.33454 5 Africa 1972 47.45094 6 Africa 1977 49.58042 7 Africa 1982 51.59287 8 Africa 1987 53.34479 9 Africa 1992 53.62958 10 Africa 1997 53.59827 # ... with 50 more rows
5.14 Chaining
- What if we want to select all countries in Africa and calculate mean life expectancy by year?
- This is easy to do because the dplyr functions always take the data as their first argument and always return a data.frame
5.15 Chaining, continued
- One option:
summarize(group_by(filter(gapminder,
continent == "Africa"),
year),
mean_life = mean(lifeExp))
- Or we could assign to objects along the way
just_africa <- filter(gapminder,continent == "Africa"), africa_by_year <- group_by(just_africa, year) summarize(africa_by_year, mean_life = mean(lifeExp))
5.16 Piping
- Both of those have downsides, though
- We’ll use the pipe
%>%
to “pipe” the thing on the left into the thing on the right: - Tip: In Rstudio, use Ctrl+shift+m (Cmd+shift+m) to get
%>%
gapminder %>% filter(continent == "Africa") %>%
group_by(year) %>%
summarize(meanlife = mean(lifeExp))
5.17 Quiz
- Create a data.frame containing the continent, year, avg life expectancy, and change in avg life expectancy
5.18 Quiz answers
gapminder %>% group_by(continent, year) %>% summarize(avg_life = mean(lifeExp)) %>% mutate(change_life = avg_life - lag(avg_life))
# A tibble: 60 x 4 # Groups: continent [5] continent year avg_life change_life <fctr> <int> <dbl> <dbl> 1 Africa 1952 39.13550 NA 2 Africa 1957 41.26635 2.13084615 3 Africa 1962 43.31944 2.05309615 4 Africa 1967 45.33454 2.01509615 5 Africa 1972 47.45094 2.11640385 6 Africa 1977 49.58042 2.12948077 7 Africa 1982 51.59287 2.01244231 8 Africa 1987 53.34479 1.75192308 9 Africa 1992 53.62958 0.28478846 10 Africa 1997 53.59827 -0.03130769 # ... with 50 more rows
5.19 Ungrouping
- Note that our answer had “continent” as a group
- It’s easy to forget about this, so if you’re saving the object for use later, you may want to run
ungroup()
to undo the grouping on the data.frame.
5.20 Other data manipulation
- Those commands take care of the most common data manipulation tasks
- There’s tons more but we don’t have the time to go over them all
- Search engines and R’s help are your friend
5.21 Review
- We learned how to use some of the most common
dplyr
functions to manipulate data (filter, select, mutate, summarize) group_by
makes doing this by groups super easy- Piping can make it easier to read code
6 Diamonds
6.1 The diamonds dataset
- We just learned a lot, let’s apply some of it to a new dataset
- I’m also going to switch from this powerpoint to a “live demo!”
7 Reporting from R
7.1 Reporting
- We’ve learned most of what you need to do data analysis!
- Now let’s do a new analysis on how to report, so we’ll learn
- How to report
- Review much of what we learned
- Learn a few more tricks and tips
- Right now is a good time to “restart” R and to make a project
- I put mine in
~/research/awesome-paper/
but you can put yours wherever!
7.2 New data
- Let’s change the dataset we’re using, just for something new:
- We’ll use the
midwest
dataset fromggplot2
, which has info on some U.S. midwest counties:
library(tidyverse)
midwest
7.3 Descriptive statistics
- Let’s look at the relationship between college education and the percent living in poverty. And maybe this looks different in metro areas, so let’s keep that in mind too.
- I always like to show some descriptive statistics
- Find the mean and standard deviation of our three variables!
midwest %>% select(percbelowpoverty, percollege, inmetro) %>% summarize_all(funs(mean, sd))
7.4 Functions
- What if we want another function other than mean, sd, etc?
- Very likely that it’s either in base R or someone has written it
- Or you can write a function yourself!
- This is actually really easy in R
7.5 Custom functions
- Let’s pretend R didn’t have a
mean
function - How would we write it?
- What do we need to find?
\[ \frac{1}{n} \sum x \]
sum(x) / length(x)
7.6 Custom functions
my_mean <- function(x){ sum(x) / length(x) } my_mean(-1:10)
[1] 4.5
But what about NA???
7.7 If statements
- An
if
statement allows us to conditionally execute code
my_name <- "Alex" if (my_name == "Alex"){ print("I'm Alex!!!") } else{ print("You aren't Alex!!!") }
7.8 Back to the NA problem
How to modify our function???
my_mean <- function(x){ sum(x) / length(x) }
- Solution: Use an if statement! But we gotta let the user tell us whether to remove NA…
7.9 Arguments and defaults
my_mean <- function(x, na.rm = FALSE){ if(na.rm){ x <- x[!is.na(x)]} sum(x) / length(x) }
7.10 Test your functions
Always test a function to make sure it works!
my_mean(c(NA, 0, 1), TRUE) my_mean(c(NA, 0, 1), FALSE)
7.11 Back to our regularly scheduled program…
midwest %>% select(percbelowpoverty, percollege, inmetro) %>% summarize_all(funs(mean, sd))
- But what if we want to show that in our paper?
7.12 stargazer
- There are several packages that let you easily make \LaTeX tables, let’s use
stargazer
:
library(stargazer)
- Can handle Word too, need to do an html dance. See package docs.
7.13 Descriptive stats, latex table:
midwest %>% select(percbelowpoverty, percollege, inmetro) %>% ## stargazer is picky about tibbles vs data.frames as.data.frame %>% stargazer(out = "../output/desc-stats.tex", title = "Descriptive Statistics")
7.14 Descriptive stats, latex table result
- use
\input{output/desc-stats.tex}
to import the table into your paper
7.15 Plot 1
- Let’s make a scatterplot!
- Make a scatterplot with
percbelowpoverty
on the y-axis and include info onpercollege
andinmetro
7.16 Plot 1, simple
g <- midwest %>% ## inmetro is a number but needs to be discrete. ## as.logical will convert so that a 0 is FALSE mutate(inmetro = as.logical(inmetro)) %>% ggplot(aes(percollege, percbelowpoverty, color = inmetro, shape = inmetro)) + geom_point()
7.18 More about graphs
- Note that I assigned the plot to an object
g
- We might want to change some more stuff about the graph (legends, assign colors, etc)
- This way I don’t have to re-run the same code
7.19 Adjust the scale
- You may want to change the color, label legends, etc
- use
scale_aes_type
to do so - So, for example, we can do
scale_color_manual
to change the properties of the color scale. - Let’s change it so that metro areas are blue and rural areas are red:
## Plain red is super harsh, let's scale it back a bit: g + scale_color_manual(values = c("red3", "blue"))
7.21 Changing legend labels
- Of course,
FALSE
andTRUE
are not good legend labels. We can change those too with thescale_color_manual
command:
g + scale_color_manual(values = c("red3", "blue"), labels = c("Rural", "Urban"))
7.23 UGHHHHHHHHH
- Now the legends are separate, though. Need to tell the shape aesthetic to use the same labels!
- While we’re at it, let’s remove the legend title (name):
- Since we’re done changing the scales, let’s reassign
g
g <- g + scale_color_manual(values = c("red3", "blue"), labels = c("Rural", "Urban"), name = "") + scale_shape_discrete(labels = c("Rural", "Urban"), name = "")
7.25 Axis labels
- We should probably fix up our axis labels
- Note that if you want to give the plot a title, subtitle, or caption, you may do so here
g <- g + labs(y = "Percent below poverty", x = "Percent with a college degree")
7.27 Background and themes
- I’m not a fan of the default grey background.
- You can adjust everything yourself, but there are several themes that come built-in
- The package
ggthemes
has many other themes - You can make it look like you’re graphing for the economist. Or from Stata
7.28 Much themes, wow
g + theme_grey() g + theme_gray() g + theme_bw() g + theme_linedraw() g + theme_light() g + theme_dark() g + theme_minimal() g + theme_classic() g + theme_void()
7.29 Plot 1, full
midwest %>% mutate(inmetro = as.logical(inmetro)) %>% ggplot(aes(percollege, percbelowpoverty, color = inmetro, shape = inmetro)) + geom_point() + scale_color_manual(values = c("red3", "blue"), labels = c("Rural", "Urban"), name = "") + scale_shape_discrete(labels = c("Rural", "Urban"), name = "") + labs(x = "Percent below poverty line", y = "Percent with a college education") + theme_bw()
7.31 How to save ggplots
- The
ggsave
function saves a plot (by default, the last one you plotted) - It’s important to specify the width and height
ggsave("../output/my-scatterplot.eps", ## Important to specify!!! width = 9, height = 6.5)
7.32 Linear regression
Let’s run a linear predicting poverty with education and include an interaction term for inmetro
- Yes, I’m ignoring all kinds of issues with this particular model
my_reg <- lm(percbelowpoverty ~ percollege * inmetro, data = midwest) summary(my_reg)
7.33 Linear regression table
stargazer(my_reg,
out = "../output/my-reg.tex")
- Use
\input{output/my-reg.tex}
in your \LaTeX document to import the table!
7.34
7.35 Multiple models
- Oftentimes, we want multiple models
- You can, of course, copy paste code, but that’s error prone (typos, ugh), and difficult to change later on
- There’s an easy solution for this, but first let’s talk about a data structure we haven’t mentioned much yet:
THE LIST
7.36 Lists
- A list is one dimensions (like numeric, logical, character, factor)
- But each element can be of a different type
- We can create lists with the
list
command - Look at the difference:
7.37 Lists, continued
c(3, TRUE, "Nancy")
[1] "3" "TRUE" "Nancy"
list(3, TRUE, "Nancy")
[[1]] [1] 3 [[2]] [1] TRUE [[3]] [1] "Nancy"
7.38 Lists, more
- Subsetting lists can be a little weird
- We use
[[
or[
to subset - First, create a list:
x <- list(c(1:10), c(TRUE, NA, TRUE), c("Bob", "Alice", "Nancy", "Drew"))
7.39 Subsetting lists
- What is the difference:
x[[1]] x[1]
7.40 Named lists
- The double bracket contains the thing at the position,
- Single bracket returns a list of the thing at the position
- Elements of a list can have names:
names(x) <- c("nums", "logs", "chars") ## Can also specify at creation time e.g. list(nums = 1:10) etc
7.41 Named lists, continued
x
$nums [1] 1 2 3 4 5 6 7 8 9 10 $logs [1] TRUE NA TRUE $chars [1] "Bob" "Alice" "Nancy" "Drew"
7.42 Subsetting named lists
- We can now access elements of the list by name instead of by position:
x$chars
[1] "Bob" "Alice" "Nancy" "Drew"
7.43 Data frames are lists too!
- Remember we can use
dataframe$varname
to access variables from a data frame?
- Does this look similar to what we just did with lists?
- That’s because data frames are secretly lists themselves!
7.44 Back to modeling
- OK, why did we just learn about lists?
- We were modeling percent below poverty with an interaction between college education and metro area status
- What if we want to “build the model” by including constituent variables one at a time?
- One way:
7.45 Multiple models
model1 <- lm(percbelowpoverty ~ percollege, data = midwest) model2 <- lm(percbelowpoverty ~ inmetro, data = midwest) model3 <- lm(percbelowpoverty ~ percollege * inmetro, data = midwest)
7.46 Multiple models
- But if we do that, we now have three models just floating around.
- To get summary measures:
summary(model1) summary(model2) summary(model3)
7.47 Multiple models
Y-hats:
predict(model1) predict(model2) predict(model3)
7.48 The problem
- That’s just with three models!
- Sometimes we run many more and the problem only gets worse!
- Idea! let’s use the list to make life easier!
7.49 Multiple models with a list
my_formulae <- list(model1 = percbelowpoverty ~ percollege,
model2 = percbelowpoverty ~ inmetro,
model3 = percbelowpoverty ~ percollege * inmetro)
7.50 Run the models!
- Base R provides
lapply
which iterates over lists
my_regs <- lapply(my_formulae, function(l_ele){lm(l_ele, data = midwest)})
- First argument is a list, second is a function to apply to each element of the list
- We use an anonymous function - one that we create on the fly. You could’ve created a named function too like we did with
my_mean
previously
7.51 Run the models!
- I don’t really like that syntax though so I use
map
from thepurrr
package. - This will do the same thing; the tilde magically creates an anonymous function in the background
my_regs <- map(my_formulae, ~ lm(.x, data = midwest))
7.52 Summarize the models
map(my_regs, summary)
7.53 Get predicted values
map(my_regs, predict)
7.54 Get residuals
map(my_regs, residuals)
7.55 Broom
The broom package has three functions that turns models into data.frames:
glance()
returns a row with model quality/complexitytidy()
returns a row for each coefficientaugment()
returns a row for every row in the data, adding some values (usually residuals and the like)
7.56 Get fit statistics
map(my_regs, broom::glance)
7.57 broom::tidy
map(my_regs, broom::tidy)
7.58 broom::augment
map(my_regs, broom::augment)
7.59 Reporting multiple models
- The stargazer function is smart enough to figure out multiple models:
stargazer(my_regs,
out = "../output/my-reg.tex")
7.60
7.61 Plotting predicted values
- We usually want to plot the predicted values from our models
- We’ll keep using a linear model, but this can really be anything
- Let’s say we want to compare how adding the interaction term affects our predictions
- One way: Plot predicted values from our first and third regressions!
- Let’s add them to our data.frame
7.62 Getting predicted values
my_midwest <- midwest my_midwest$pred_m1 <- predict(my_regs$model1) my_midwest$pred_m3 <- predict(my_regs$model3)
7.63 Generate plots
ggplot(my_midwest, aes(percollege)) + geom_line(aes(y = pred_m1)) + geom_line(aes(y = pred_m3, linetype = as.logical(inmetro)), color = "blue")
7.65 Merging
- Let’s talk about merging data!
- Oftentimes we have different datasets that we need to merge together for whatever reason
- dplyr refers to “merging” as “joining,” which is language borrowed from SQL
- Let’s go to another “live demo”
- Let’s look at two toy datasets that come with dplyr
7.66 How to get help
- Stack overflow (but have an MRE)
- Twitter (#rstats)
- Me! (branham@utexas.edu)
7.67 Other tools that work well with R
- Git
- \LaTeX
- (r)markdown
7.68
Thanks for coming!
Footnotes:
1
Technically, a tibble
, but the difference isn’t very much, so we’ll ignore that