Reshape

Module Learning Objectives

By the end of this module, you will be able to:

  • Contrast “long” data with “wide” data
  • Use tidyr’s pivot_wider and pivot_longer functions to reshape data

Defining “Shape”

Before talking about how to reshape your data between wide and long format, let’s talk about what “shape” means in reference to data. Fundamentally, “long” data are data with more rows than columns while “wide” data tend to have more columns than rows.

For example, in community ecology a “wide” dataframe could have each row being a site that researchers visited while each column could be a different species where the value in the row is the number of individuals of that species at that site. On the other hand, the penguins dataframe we’ve been working with so far is in “long” format because it has one row per penguin and multiple penguins are stacked up.

Both wide and long format data can be useful in certain contexts and it is sometimes most intuitive to reshape data from one form to the other (and sometimes back again to the original form!).

Reshaping Data

The tidyr package contains the intuitively-named pivot_wider and pivot_longer for doing exactly this reshaping.

To help demonstrate these two functions, let’s begin by summarizing our dataframe to make changing the shape of the dataframe more visible than it would be with the full dataframe. For example, let’s calculate the average bill length of each penguin species on each island.

# Begin by naming our new data and the data they come from
penguins_simp <- penguins %>%
  # Now group by species and island
  dplyr::group_by(species, island) %>%
  # Calculate average bill length
  dplyr::summarize(avg_bill_length_mm = mean(bill_length_mm, na.rm = TRUE)) %>%
  # And don't forget to ungroup!
  dplyr::ungroup()

# And this is what we're left with:
penguins_simp
# A tibble: 5 × 3
  species   island    avg_bill_length_mm
  <fct>     <fct>                  <dbl>
1 Adelie    Biscoe                  39.0
2 Adelie    Dream                   38.5
3 Adelie    Torgersen               39.0
4 Chinstrap Dream                   48.8
5 Gentoo    Biscoe                  47.5

Great! We can use this smaller data object to demonstrate reshaping more clearly. Let’s begin with an example for pivot_wider.

pivot_wider Example: Reshaping to Wide Format

Example

pivot_wider takes long format data and reshapes it into wide format.

Graphic of a table with 'A' and 'B' columns being pivoted to a table with 'A', 'C', and 'D' columns and a 'B' row

Let’s say that we want to take that data object and reshape it into wide format so that each island is a column and each species of penguin is a row. The contents of each cell then are going to be the average bill length values that we just calculated.

# Begin by naming the objects
penguins_wide <- penguins_simp %>%
  # And now we can pivot wider with `pivot_wider`!
  tidyr::pivot_wider(names_from = island,
                     values_from = avg_bill_length_mm )

# Take a look!
penguins_wide
# A tibble: 3 × 4
  species   Biscoe Dream Torgersen
  <fct>      <dbl> <dbl>     <dbl>
1 Adelie      39.0  38.5      39.0
2 Chinstrap   NA    48.8      NA  
3 Gentoo      47.5  NA        NA  

Great! We now have each island as a column, each row is a penguin species, and the bill length measurement we took is included in each cell. Note that in this specific case this makes the number somewhat ambiguous so we might want to use dplyr’s select or the more specific rename to change the island names to be clearer that those values are bill lengths in milimeters.

pivot_longer Example: Reshaping to Long Format

Example

Now that we have a small wide format data object, we can feed it to pivot_longer and reshape our data into long format! pivot_longer has very similar syntax except that with pivot_longer you need to tell the function which columns should be reshaped.

pivot_wider on the other hand knows which columns to move around because you manually specify them in the “names_from” and “values_from” arguments.

Graphic of a table with 'A' and 'B' columns being pivoted to a table with 'C' and 'D' columns and 'A' and 'B' rows

# Begin with our wide data
penguins_wide %>%
  # And reshape back into long format
  pivot_longer(cols = -species,
               names_to = "island_name",
               values_to = "mean_bill_length_mm" )
# A tibble: 9 × 3
  species   island_name mean_bill_length_mm
  <fct>     <chr>                     <dbl>
1 Adelie    Biscoe                     39.0
2 Adelie    Dream                      38.5
3 Adelie    Torgersen                  39.0
4 Chinstrap Biscoe                     NA  
5 Chinstrap Dream                      48.8
6 Chinstrap Torgersen                  NA  
7 Gentoo    Biscoe                     47.5
8 Gentoo    Dream                      NA  
9 Gentoo    Torgersen                  NA  

Two quick things to note here:

  • First, pivot_longer included the cells that were NA in the wide version of the data.
    • This default behavior is really nice so that you don’t lose any cells implicitly (though you can always filter them out if you don’t want them!).
  • Second, you’ll note that in the “cols” argument I only told pivot_longer to not include the “species” column using the same notation you could use for the select function in the dplyr package.
    • This is very handy because it lets us write really concise values in the “cols” argument and the default becomes “everything except what was specified”.
    • Note that we could have also said cols = Biscoe, Dream, Torgersen and achieved the same reshaping of the data.

Challenge: Reshaping

Your Turn!

The code below creates a data object that includes the flipper length of all Adelie penguins; what code would you add to reshape the data so that each sex is a column with flipper lengths in the cells?

penguins %>%
  # Keep only Adelie penguins of known sex
  dplyr::filter(species == "Adelie" & !is.na(sex)) %>%
  # Calculate the average flipper length by island and sex
  dplyr::group_by(island, sex) %>%
  dplyr::summarize(avg_flipper_length_mm = mean(flipper_length_mm, na.rm = TRUE)) %>%
  # Ungroup (good practice to include this step!)
  dplyr::ungroup()
# A tibble: 6 × 3
  island    sex    avg_flipper_length_mm
  <fct>     <fct>                  <dbl>
1 Biscoe    female                  187.
2 Biscoe    male                    190.
3 Dream     female                  188.
4 Dream     male                    192.
5 Torgersen female                  188.
6 Torgersen male                    195.