Wrangle

Module Learning Objectives

By the end of this module, you will be able to:

  • Manipulate rows and columns with dplyr’s select and filter functions
  • Create new columns with dplyr’s mutate function and fill them conditionally with case_when (also from dplyr)
  • Use tidyr’s separate_wider_delim function to split a column into two

What are Tidy Data?

What are some common things you like to do with your data? Maybe remove rows or columns, do calculations and add the results as new columns? These operations (and others) are called “data wrangling”. The data we get to work with are rarely, if ever, in the format we need to do our analyses and data wrangling can help bridge that gap. dplyr and tidyr are two R packages from the Tidyverse that provide a fairly complete and extremely powerful set of functions for us to do virtually all needed wrangling quickly. Here we introduce some commonly used functions from these two packages.

We can use glimpse from the dplyr package to look at part of the data while also getting some relevant structural information (i.e., what type of data are in each column, etc.).

# install.packages("tidyverse", "palmerpenguins")
library(tidyverse)
library(palmerpenguins)

dplyr::glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Selecting Columns

To start off, how do we do a fundamental action like selecting the columns we want? dplyr’s select function provides us with a straightforward way to do just that. We only need to provide the column names!

Graphic of a dataframe with columns labeled 'A', 'B', and 'C' being pare down to just the 'A' column

Note that even if you select just one column, a dataframe will be returned. Whereas if you use the $ operator you get a vector (e.g., data$column returns a vector, not a dataframe).

select Example: Including & Excluding

Example

To select only the species, island, and body_mass_g columns, we can use the following code:

# Provide the name of the data and then the columns that you want!
penguins_selected <- dplyr::select(.data = penguins, species, island, body_mass_g)

# Look at the product
dplyr::glimpse(penguins_selected)
Rows: 344
Columns: 3
$ species     <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
$ island      <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Tor…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, …

If we want to remove specific columns, we can use the - operator.

# Put a "-" in front of each column you would like to remove from your dataframe
penguins_selected <- dplyr::select(.data = penguins, -flipper_length_mm, -sex)

# Look at the product
dplyr::glimpse(penguins_selected)
Rows: 344
Columns: 6
$ species        <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie,…
$ island         <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, …
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.…
$ bill_depth_mm  <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.…
$ body_mass_g    <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 425…
$ year           <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2…

select Example: Selecting with Helper Functions

Example

If we want to select the columns that contain length measurements, we can manually type bill_length_mm and flipper_length_mm, but there’s actually an easier way using the contains function, also from the dplyr package. Enter a string that matches what you’re looking for among the column names.

# Enter a matching string inside of a `select` call
penguins_selected <- dplyr::select(.data = penguins, dplyr::contains("length"))

# Look at the product
dplyr::glimpse(penguins_selected)
Rows: 344
Columns: 2
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…

select Example: Selecting a Range of Columns

Example

Now what if we wanted all the columns from the first column species to the sixth column body_mass_g? We can use a colon, :, between the first (leftmost) and last (rightmost) columns in the range that we want to include.

# Use a colon to indicate a range of columns you want to select
penguins_selected <- dplyr::select(.data = penguins, species:body_mass_g)

# Look at the product
dplyr::glimpse(penguins_selected)
Rows: 344
Columns: 6
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …

Conveniently selecting a range of columns can be especially useful when you have a big dataframe and don’t want to exhaustively list every column by name.

Subsetting Rows

Instead of selecting certain columns, how can we get a subset of rows that meet certain conditions? For example, in the diagram below, how can we filter for rows that contain a diamond shape? We can use dplyr’s handy filter function along with logical and boolean operators!

Graphic of a table with only an 'A' column and rows with different shapes getting subsetted to only the row with a diamond

For reference, here are the operators we can use to specify our conditions with filter.

  • == – two items are exactly equal to one another
  • != – two items are not equal to one another
  • </> – one item is less/greater than the other
  • <=/>= – one item is less/greater than or equal to the other
  • | – one statement or another is TRUE
  • &both statements are TRUE
  • %in% – a value is in a set of other values

You may have noticed that filter accepts the same operators that base R’s subset function does. This is no accident and filter is one of the more accessible Tidyverse functions because the syntax it shares with its base R equivalent.

To get familiar with these operators, let’s see some examples!

filter Example: Exactly Equal

Example

To make a subset of our data that only contains information on Chinstrap penguins, we would use the == operator for “exactly equal to”

# Get all the rows where the species is "Chinstrap" 
penguins_filtered <- dplyr::filter(.data = penguins, species == "Chinstrap")

head(penguins_filtered)
# A tibble: 6 × 8
  species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
1 Chinstrap Dream            46.5          17.9               192        3500
2 Chinstrap Dream            50            19.5               196        3900
3 Chinstrap Dream            51.3          19.2               193        3650
4 Chinstrap Dream            45.4          18.7               188        3525
5 Chinstrap Dream            52.7          19.8               197        3725
6 Chinstrap Dream            45.2          17.8               198        3950
# ℹ 2 more variables: sex <fct>, year <int>

Note that we need to write the value we’re looking for as a character string bookended by quotation marks.

filter Example: Either / Or

Example

What if we wanted to get all the rows where the penguin species is “Chinstrap” or “Gentoo”? In other words, we want all the rows where either condition is true. There are two options to do this. The first option is to use the “or” operator (|) between each of the conditions.

# Get all the rows where the species is "Chinstrap" or "Gentoo"
penguins_filtered <- dplyr::filter(.data = penguins,
                             species == "Chinstrap" | species == "Gentoo")

unique(penguins_filtered$species)
[1] Gentoo    Chinstrap
Levels: Adelie Chinstrap Gentoo

This method works fine for a few options but begins to get cumbersome when you have many possible conditions that you’d like to retain. In these cases you can use the %in% operator followed by a vector of values that you want to include in your filter.

# Get all the rows where the species is "Chinstrap" or "Gentoo"
penguins_filtered <- dplyr::filter(.data = penguins,
                             species %in% c("Chinstrap", "Gentoo"))

unique(penguins_filtered$species)
[1] Gentoo    Chinstrap
Levels: Adelie Chinstrap Gentoo

filter Example: Multiple Conditions

Example

We can also keep rows where both conditions are met by using the & operator to specify multiple conditions that must all be true. To keep only the rows where the species is “Adelie” and the island is “Dream”, we can use the following code:

# Get all the rows where the species is "Adelie" and the island is "Dream"
penguins_filtered <- dplyr::filter(.data = penguins,
                             species == "Adelie" & island == "Dream")

dplyr::glimpse(penguins_filtered)
Rows: 56
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Dream, Dream, Dream, Dream, Dream, Dream, Dream, Dre…
$ bill_length_mm    <dbl> 39.5, 37.2, 39.5, 40.9, 36.4, 39.2, 38.8, 42.2, 37.6…
$ bill_depth_mm     <dbl> 16.7, 18.1, 17.8, 18.9, 17.0, 21.1, 20.0, 18.5, 19.3…
$ flipper_length_mm <int> 178, 178, 188, 184, 195, 196, 190, 180, 181, 184, 18…
$ body_mass_g       <int> 3250, 3900, 3300, 3900, 3325, 4150, 3950, 3550, 3300…
$ sex               <fct> female, male, female, male, female, male, male, fema…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

filter Example: Greater Than / Less Than

Example

When subsetting by numeric columns, we can use greater than (>) and less than (<) to capture the range of possible values that meet that criteria. If you want to include an “or equal to” clause, just add an equal sign to the right of the greater/less than sign (e.g., >= or <=).

For instance, we can subset the data for only penguins whose bills are longer than 50 millimeters.

# Filter based on bill length
penguins_filtered <- dplyr::filter(.data = penguins, bill_length_mm > 50)

sort(penguins_filtered$bill_length_mm)
 [1] 50.1 50.1 50.2 50.2 50.2 50.3 50.4 50.4 50.5 50.5 50.5 50.5 50.5 50.6 50.7
[16] 50.7 50.8 50.8 50.8 50.8 50.9 50.9 51.0 51.1 51.1 51.3 51.3 51.3 51.3 51.4
[31] 51.5 51.5 51.7 51.9 52.0 52.0 52.0 52.1 52.2 52.2 52.5 52.7 52.8 53.4 53.5
[46] 54.2 54.3 55.1 55.8 55.9 58.0 59.6

Note that when filtering for numeric columns we do not need the quotation marks around the number(s) we use to filter.

filter Example: Exclusion Criteria

Example

Sometimes it’s faster to subset the rows that do not meet a condition, rather than listing everything that we do want to keep. This is where the != operator (or “not equal to”) becomes useful. More generally, the exclamation mark indicates negation in the operator.

# Get all the rows where the species is NOT "Chinstrap"
penguins_filtered <- dplyr::filter(.data = penguins, species != "Chinstrap")

dplyr::glimpse(penguins_filtered)
Rows: 276
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Challenge: filter

Your Turn!

Using filter, how would you get all of the rows that do not have any NA values in the sex column?

Making and Modifying Columns

Aside from selecting columns and subsetting rows, we may want to create new columns in our data. For instance, in the diagram below, we have a dataframe that only contains column A, and then we add new columns B and C. We can use dplyr’s mutate function to add a new column, while keeping the existing columns.

Graphic of a table with an 'A' column gaining a 'B' and 'C' column

The general syntax to add a new column to your dataframe is as follows:

your_data_v2 <- dplyr::mutate(.data = your_data, new_column_name = what_it_contains)

mutate Example: Making New Columns

Example

If we wanted to add a new column that has the penguin’s body mass in kilograms, we can do some arithmetic on the body_mass_g column and store the result in a new column.

# Create a new column with the penguins' body mass in kilograms
penguins_mutated <- dplyr::mutate(.data = penguins, body_mass_kg = body_mass_g / 1000)

dplyr::glimpse(penguins_mutated)
Rows: 344
Columns: 9
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
$ body_mass_kg      <dbl> 3.750, 3.800, 3.250, NA, 3.450, 3.650, 3.625, 4.675,…

mutate Example: Overwriting Existing Columns

Example

Additionally, mutate can be used to overwrite an existing column. If we give the new column the same name as an existing column, the existing column will be replaced. As you can see, island is currently a factor. To change its class to a character, we would need to overwrite the column.

# Check current format of the `island` column
class(penguins$island)
[1] "factor"
# Modify the existing island column
penguins_mutated <- dplyr::mutate(.data = penguins, island = as.character(island))

# the `island` column is now a character!
class(penguins_mutated$island)
[1] "character"

Now island is a character column!

Conditional Operations

Sometimes in data wrangling we’ll want to generate a new column where the contents of the column are dependent upon an existing column but we have many separate “if X then Y” type statements. Such statements are called “conditional” statements in programming. You may already be familiar with base R’s ifelse function for handling cases where you have an either/or condition.

In the Tidyverse–specifically dplyr–we have case_when for handling multiple conditions in an efficient and relatively straightforward way! Why are we talking about case_when here? Because you can use case_when inside of a mutate to create a new column based on the conditions that you specify.

Here is what the general syntax for this operation looks like:

your_data_v2 <- dplyr::mutate(.data = your_data,
                              new_column_name = dplyr::case_when(
                                  condition1 ~ value_for_condition1,
                                  condition2 ~ value_for_condition2,
                                  condition3 ~ value_for_condition3,
                                  ...
                                  TRUE ~ value_if_no_conditions_are_met))

Let’s look at an example to make this somewhat more tangible.

mutate + case_when Example: Creating a New Column Conditionally

Example

Suppose we want to add a new column called flipper_rank that contains the following:

  • “short” if flipper_length_mm is < 190 mm
  • “long” if flipper_length_mm is >= 190 mm
# Enter your conditions on the left side and the values on the right side of the tilde
penguins_mutated <- dplyr::mutate(.data = penguins,
                                  flipper_rank = dplyr::case_when(
                                    flipper_length_mm < 190 ~ "short",
                                    flipper_length_mm >= 190 ~ "long")
)

dplyr::glimpse(penguins_mutated)
Rows: 344
Columns: 9
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
$ flipper_rank      <chr> "short", "short", "long", NA, "long", "long", "short…

Challenge: mutate + case_when

Your Turn!

Using mutate and case_when, create a new column called size_bin that contains the following:

  • “large” if body mass is greater than 4500 grams
  • “medium” if body mass is greater than 3000 grams, and less than or equal to 4500 grams
  • “small” if body mass is less than or equal to 3000 grams

Splitting a Column into Multiple Columns

Another relatively common task in data wrangling involves splitting the contents of one column into several columns. To demonstrate, let’s first make a new column that contains the full scientific names for these penguins using mutate and case_when.

# Remember that conditions are on the left side and the values are on the right side of the tilde
penguins_v1 <- dplyr::mutate(.data = penguins, 
                             scientific_name = dplyr::case_when(
                               species == "Adelie" ~ "Pygoscelis_adeliae",
                               species == "Chinstrap" ~ "Pygoscelis_antarcticus",
                               species == "Gentoo" ~ "Pygoscelis_papua"))

dplyr::glimpse(penguins_v1)
Rows: 344
Columns: 9
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
$ scientific_name   <chr> "Pygoscelis_adeliae", "Pygoscelis_adeliae", "Pygosce…

If we want to split the scientific name into genus and specific epithet, we can use the separate_wider_delim function from the tidyr package.

separate_wider_delim Example: Splitting a Column Apart

Example

Using our new scientific name column, suppose we want to split it so that scientific_name becomes two columns: genus and epithet. Using tidyr’s separate_wider_delim function we can do this in a single step!

# Indicate the column you want to split, the separator, and the new column names!
penguins_separated <- tidyr::separate_wider_delim(data = penguins_v1,
                                                  cols = scientific_name,
                                                  delim = "_",
                                                  names = c("genus", "epithet"))

dplyr::glimpse(penguins_separated)
Rows: 344
Columns: 10
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
$ genus             <chr> "Pygoscelis", "Pygoscelis", "Pygoscelis", "Pygosceli…
$ epithet           <chr> "adeliae", "adeliae", "adeliae", "adeliae", "adeliae…