๐Ÿ“˜ Unit 2.1: Tidy Data Philosophy and the Tidyverse Ecosystem

Unit Objectives

By the end of this unit, the student will be able to:

  • Understand and apply the fundamental principles of tidy (rectangular) data.
  • Install, load, and manage the tidyverse ecosystem.
  • Identify the main packages within the tidyverse and their specific purposes.
  • Use the pipe operator (%>%) to chain operations in a readable manner.
  • Begin structuring data analysis workflows following the tidy philosophy.

1. What is Tidy Data?

The tidy data philosophy was formalized by Hadley Wickham in his paper โ€œTidy Dataโ€ (2014). It defines a standard for structuring datasets to facilitate analysis, visualization, and manipulation. A dataset is in tidy format if it adheres to the following three rules:

Rule 1: Each variable forms a column.

A variable represents a measured quantity or characteristic (e.g., age, income, country, date). In tidy data, each variable must occupy exactly one column.

โŒ Non-tidy Example:

# The "2020", "2021", "2022" columns represent years โ†’ these are values, not variables!
annual_sales <- data.frame(
  product = c("A", "B"),
  `2020` = c(100, 150),
  `2021` = c(120, 160),
  `2022` = c(130, 170)
)

โœ… TIDY Example:

library(tidyr)
long_sales <- annual_sales %>%
  pivot_longer(cols = c(`2020`, `2021`, `2022`),
               names_to = "year",
               values_to = "sales")
# Now "year" is a column (variable) and "sales" is another โ†’ each variable in its own column!

Rule 2: Each observation forms a row.

An observation is a unique instance of measurement (e.g., the sale of a product in a specific year). Each observation must occupy exactly one row.

Rule 3: Each observed value forms a cell.

Each cell must contain a single atomic value. Multiple values should not be combined in one cell (e.g., โ€œMadrid, Spainโ€ should be split into two columns: city and country).


2. Benefits of Tidy Data

  • Consistency: All datasets follow the same structure, allowing code and functions to be reused.
  • Interoperability: tidyverse tools are designed to work with tidy data.
  • Readability: Itโ€™s easier to understand what each column and row represents.
  • Efficiency: Facilitates the use of vectorized operations and grouping functions.

๐Ÿ’ก Professional Tip: Before any analysis, convert your data to tidy format. It will save you hours of debugging and confusing code.


3. Introduction to the Tidyverse

The tidyverse is a collection of R packages designed to work together and facilitate data analysis under the tidy philosophy. It was created and is primarily maintained by Hadley Wickham and the Posit team (formerly RStudio).

Main Packages in the Tidyverse

Package Primary Purpose Key Functions
ggplot2 Data visualization ggplot(), geom_point(), aes()
dplyr Data frame manipulation filter(), select(), mutate(), group_by(), summarise()
tidyr Data cleaning and reshaping pivot_longer(), pivot_wider(), separate(), unite()
readr Data import (CSV, TSV, etc.) read_csv(), write_csv()
purrr Functional programming (iteration over lists/vectors) map(), map_dbl(), walk()
tibble Modernization of data frames tibble(), as_tibble()
stringr String manipulation str_detect(), str_replace(), str_split()
forcats Factor manipulation fct_reorder(), fct_lump(), fct_infreq()
lubridate Date and time manipulation ymd(), dmy(), interval(), now()

๐Ÿ“ฆ Note: Some packages like stringr, forcats, and lubridate are not automatically loaded with library(tidyverse), but they are installed alongside it.


4. Installing and Loading the Tidyverse

Installation (one-time only)

install.packages("tidyverse")

Loading in each session

library(tidyverse)
# โ”€โ”€ Attaching core tidyverse packages โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# โœ” dplyr     1.1.4     โœ” readr     2.1.5
# โœ” forcats   1.0.0     โœ” stringr   1.5.1
# โœ” ggplot2   3.5.1     โœ” tibble    3.2.1
# โœ” lubridate 1.9.3     โœ” tidyr     1.3.1
# โœ” purrr     1.0.2
# โ”€โ”€ Conflicts โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# โœ– dplyr::filter() masks stats::filter()
# โœ– dplyr::lag()    masks stats::lag()

โš ๏ธ Warning: The โ€œConflictsโ€ message indicates that some tidyverse functions (like filter() and lag()) mask functions from the base stats package. This is normal and intended. To use the base version, you can specify stats::filter().


5. The Pipe Operator (%>%)

The %>% operator (pronounced โ€œpipeโ€) is a fundamental tool in the tidyverse. It allows you to chain operations from left to right, improving readability and avoiding complex nesting.

Syntax

object %>% function(arguments)

This is equivalent to:

function(object, arguments)

Example without pipe (nested, hard to read)

head(select(filter(mtcars, cyl == 4), mpg, hp))

Example with pipe (clear and sequential)

mtcars %>%
  filter(cyl == 4) %>%
  select(mpg, hp) %>%
  head()

๐Ÿ’ก RStudio Shortcut: Ctrl + Shift + M (Windows/Linux) or Cmd + Shift + M (Mac) inserts %>%.

Modern Pipes: |> (Base R 4.1+)

Since R 4.1, there is a native pipe: |>. It works similarly but does not allow the use of . as a placeholder. For most tidyverse use cases, %>% is still preferred for its flexibility.


6. Tidy Workflow: A Practical Example

We will analyze the gapminder dataset (install with install.packages("gapminder")), which contains data on life expectancy, population, and GDP per country and year.

# Load libraries
library(tidyverse)
library(gapminder)

# Explore structure (already in tidy format!)
gapminder %>%
  glimpse()
# Rows: 1,704
# Columns: 6
# $ country   <fct> "Afghanistan", "Afghanistan", ...
# $ continent <fct> Asia, Asia, ...
# $ year      <int> 1952, 1957, ...
# $ lifeExp   <dbl> 28.8, 30.3, ...
# $ pop       <int> 8425333, 9240934, ...
# $ gdpPercap <dbl> 779, 821, ...

# Filter for Americas, select columns, create new variable
gapminder %>%
  filter(continent == "Americas") %>%
  select(country, year, lifeExp, gdpPercap) %>%
  mutate(gdp_total = gdpPercap * pop) %>%
  arrange(desc(gdp_total)) %>%
  head()

This workflow demonstrates:

  • Each variable in a column (country, year, lifeExp, etc.).
  • Each observation (country-year) in a row.
  • Use of pipes to chain clear operations.
  • Creation of new variables with mutate().

7. Best Practices with Tidyverse

  1. Always work with data in tidy format. Use tidyr to transform it if it isn't.
  2. Use pipes (%>%) to improve readability. Avoid deep nesting.
  3. Load tidyverse at the start of your scripts. Ensures consistency in the environment.
  4. Prefer tibble over data.frame. Avoids unexpected behaviors (like automatic string-to-factor conversion).
  5. Use dplyr verbs instead of base subsetting. They are more explicit and readable.
  6. Document your transformations. Use comments to explain complex steps.

8. Practical Exercise

Objective: Transform a non-tidy dataset into tidy format and perform a basic exploratory analysis.

Dataset: Quarterly Sales

sales <- data.frame(
  region = c("North", "South", "East", "West"),
  Q1 = c(250, 300, 200, 350),
  Q2 = c(270, 310, 210, 360),
  Q3 = c(260, 305, 205, 355),
  Q4 = c(280, 320, 220, 370)
)

Tasks:

  1. Load the tidyverse.
  2. Use pivot_longer() to convert sales into long (tidy) format.
  3. Rename the resulting columns to quarter and amount.
  4. Filter regions where the amount is greater than 300.
  5. Calculate the average amount per quarter.
  6. Save the result in an object named quarterly_summary.

Expected Solution:

library(tidyverse)

tidy_sales <- sales %>%
  pivot_longer(cols = Q1:Q4,
               names_to = "quarter",
               values_to = "amount") %>%
  filter(amount > 300) %>%
  group_by(quarter) %>%
  summarise(average = mean(amount)) %>%
  ungroup()

quarterly_summary <- tidy_sales
print(quarterly_summary)

9. Additional Resources


โœ… Conclusion: Mastering the tidy philosophy and the tidyverse ecosystem is the first step toward efficient, reproducible, and professional data analysis in R. In upcoming units, we will delve deeper into each tool to manipulate, clean, and transform your data like an expert.

Course Info

Course: R-zero-to-hero

Language: EN

Lesson: Module05