The dplyr package contains a set of functions that are helpful in wrangling datasets into a useful format. The functions in dplyr are referred to as verbs because they generally describe the action that will be done to the dataset.

The core ‘verbs’

  1. select()
  2. arrange()
  3. filter()
  4. summarize() and group_by()
  5. mutate()

Verbs work in similar ways:

  1. First you specify the data frame or tibble ytou want to apply the action to
  2. Specify the arguements that describe what you want to do using variable names (without quotes)
  3. A new date frame is produced, usually

What is a tibble?

library(tidyverse)

mtcars #a built in dataframe

as_tibble(mtcars) #row names are lost!

as_tibble(mtcars, rownames = "make_model") #row names are now a varible in the tibble called make_model

select(): variables (columns)

data("starwars")

#prints the first parts of an object to the console 

head(starwars)

#prints the last parts of the object to the console

tail(starwars)

#glimpse transposes columns so they are shown vertically in the console. This is helpful for large data sets where the rightmost columns are truncated in other formats

glimpse(starwars)

#This tibble includes variables of the type "list". To avoid complications, lets drop those from the data set. 

#We will use the select verb which allows us to select variables (columns) by name 

starwars_cut1 <- select(starwars, name:species)

#or by column number

starwars_cut2 <- select(starwars, 1:11)

#we can also select all variables that are not (!) films, vehicles, and starships  

starwars_cut3 <- select(starwars, !c(films, vehicles, starships))

#we can also slect all variables that are not (!) lists  

starwars_cut <- select(starwars, !where(is.list))

Helper functions

#We can create a dataset of just the color variables with the helper `matches` and a regular expression

starwars_color <- select(starwars, name, matches("*color*"))

#We could also do this with `contains'

starwars_color2 <- select(starwars, name, contains("color"))

arrange(): reoders rows

#We can sort the data set by the values of a variable 

arrange (starwars_cut, by = height) #note the default ascending order

#We can specify descending 

arrange (starwars_cut, by = desc(height)) 

tail(arrange (starwars_cut, by = desc(height))) #note that the observations with missing values are last

#additional variables can be used to break ties

arrange (starwars_cut, by = eye_color, by = desc(mass))

filter(): select observations by their values

### uses >, >=, <, <=,!=, == (not just one!) for comparisons   
### Logical operators: & | !

# filter automatically excludes NAs, have to ask for them specifically 

#returns all observatons where mass is exactly 77 kg
filter(starwars_cut, mass == 77)

#returns all observatons where mass is less than or equal to 77 kg
filter(starwars_cut, mass <= 77)

#returns all observatons where mass is not 77 kg
filter(starwars_cut, mass != 77) #note that NAs have been dropped

#If we want the NAs, we must specify them. Here I used the or operator "|" to ask for all values of mass not 77 and all values that are NAs 
filter(starwars_cut, mass != 77 | is.na(mass))

#compare the results of these two filtering statements

filter(starwars_cut, eye_color == "black" & eye_color == "blue")

filter(starwars_cut, eye_color == "black" | eye_color == "blue")

%in% is a useful way to filter by several variables. This can be helpful if you want to filter by a vector of data identifiers.

filter(starwars_cut, eye_color %in% c("black","blue"))

This chart from the R for Data Science text is helpful for thinking about how to specify sets for filtering:

Boolean sets

slice() is another useful tool for subsetting that allows you to select specfic rows of data

# return the top 5 rows of the starwars_cut dataset
slice_head(starwars_cut, n=5)

mutate(): create new variables

mutate with arithmetic operators

  • operators: + , - , * , / , ^
starwars_cut <- mutate(starwars_cut, resid_mass = mass / height) #adds new variable resid_mass on to right side of the data set

starwars_resid <- transmutate(starwars_cut, resid_mass = mass / height) #creates a new data set with just resid_mass

#We can nadd name back in to make the new data set more useful
starwars_resid <- transmute(starwars_cut, name = name, resid_mass = mass / height)

#You can refer to the variable you just created in the same command
starwars_resid <- transmute(starwars_cut, name = name, resid_mass = mass / height, log10_resid_mass = log10(resid_mass))

Other operations

  • Offsets: lead() and lag() allow you to reference the values of a variable immediately before or immediately after an observation.
  • Cumulative aggregates: cumsum(),cumprod(),cummin(),cummax(),cummean, allow for the calculation of various values as you step through the observations from first to last. This will be influenced by how the data are arranged.
  • Logicals: You can apply various logical operations >, <, == etc. and return a variable with values of TRUE,FALSE, or NA.
  • Ranking: min_rank with return the rank of a value in a variable in ascending order (smallest first). To rank largest first combine: desc(min_rank()
starwars_heft <- transmute(starwars_cut, name = name, heft = min_rank(mass))
  • Grouping can be a useful tool with mutate. We will explore this more below.
  • Mutate can also be used to create new string (character) variables in conjuction with regular expression and stringr package that is part of the tidyverse (more later).

summarize(): Collapse values into a summary statisitic

Functions:

  • location - mean() and median()
  • spread - sd(), IQR(), mad()
summarize(starwars_cut, mean_mass = mean(mass), SD_mass =sd(mass)) #returns an NA because there are missing values

summarize(starwars_cut, mean_mass = mean(mass,  na.rm=TRUE), SD_mass =sd(mass,  na.rm=TRUE)) #we can remove these missing values with na.rm=TRUE

More functions

  • rank - min(), max()
    • quantile(x, 0.75) - this will return value of x that is grater than 75% of observations and less than 25%
  • Position - first(), last()
    • nth(x, 2) - this will return value at specific rank ordered postion. These commands can be ordered with a second vector.
  • counts
    • n() - size of current grouping
    • sum(!is.na(x)) - returns number of non-missing observations
    • n_distinct(x) - returns the number of unique observations
  • Adding logical operators gives counts and proportions from sum() and mean()

group_by()

Takes an existing tibble and converts it into a grouped tbl where operations are performed “by group”

Summarizing all observation of a variable is not particularly interesting. Most often we want to summarize the values of different groups or treatments. dplyr has a group_by command that allows you to define groups and then apply subsequent operations.

starwars_species<-group_by(starwars_cut, species)

glimpse(starwars_species)

summarize(starwars_species, mean_mass = mean(mass,  na.rm=TRUE), SD_mass =sd(mass,  na.rm=TRUE))

We can simplify this and make it easier to read with piping

Piping

starwars_cut %>% #take the starwars_cut data set and pass it to next step
  filter (mass > 0) %>% #filter the incoming data set and pass only observations where mass > 0. This removes NAs
  group_by(species) %>% #Group the incoming input by species name and  
  summarize(mean_mass = mean(mass), SD_mass =sd(mass)) #calculate mean and sd from the filtered and grouped Data. No intermediate files are created

Our current data set has a number of single observation and missing data so means and sd for some groups are meaningless. Lets just look at groups where we have more than one mass observation.

starwars_cut %>% 
  filter (mass > 0) %>% 
  group_by(species) %>%
  mutate(count = n()) %>% #Here we create a new variable that is a count of the total numbers of observations. Since the data are now grouped this action will be performed group-wise. 
  filter (count > 1) %>% #Now I can filter on that count to select only species where there are more than one observation. 
  summarize(mean_mass = mean(mass), SD_mass =sd(mass)) #calculate mean and sd from the filtered and grouped Data. No intermediate files are created
  
#We can condense this and remove the mutate verb

starwars_cut %>% 
  filter (mass > 0) %>% 
  group_by(species) %>%
  filter (n() > 1) %>% #In one step this will calculate the group count and filter by that count
  summarize(mean_mass = mean(mass), SD_mass =sd(mass))