📘 Unit 2.1: Tidy Data Philosophy and the Tidyverse Ecosystem

Unit Objectives

By the end of this unit, the student will be able to:

Understand and apply the fundamental principles of tidy (rectangular) data.
Install, load, and manage the tidyverse ecosystem.
Identify the main packages within the tidyverse and their specific purposes.
Use the pipe operator (%>%) to chain operations in a readable manner.
Begin structuring data analysis workflows following the tidy philosophy.

1. What is Tidy Data?

The tidy data philosophy was formalized by Hadley Wickham in his paper “Tidy Data” (2014). It defines a standard for structuring datasets to facilitate analysis, visualization, and manipulation. A dataset is in tidy format if it adheres to the following three rules:

Rule 1: Each variable forms a column.

A variable represents a measured quantity or characteristic (e.g., age, income, country, date). In tidy data, each variable must occupy exactly one column.

❌ Non-tidy Example:

# The "2020", "2021", "2022" columns represent years → these are values, not variables!
annual_sales <- data.frame(
  product = c("A", "B"),
  `2020` = c(100, 150),
  `2021` = c(120, 160),
  `2022` = c(130, 170)
)

✅ TIDY Example:

library(tidyr)
long_sales <- annual_sales %>%
  pivot_longer(cols = c(`2020`, `2021`, `2022`),
               names_to = "year",
               values_to = "sales")
# Now "year" is a column (variable) and "sales" is another → each variable in its own column!

Rule 2: Each observation forms a row.

An observation is a unique instance of measurement (e.g., the sale of a product in a specific year). Each observation must occupy exactly one row.

Rule 3: Each observed value forms a cell.

Each cell must contain a single atomic value. Multiple values should not be combined in one cell (e.g., “Madrid, Spain” should be split into two columns: city and country).

2. Benefits of Tidy Data

Consistency: All datasets follow the same structure, allowing code and functions to be reused.
Interoperability: tidyverse tools are designed to work with tidy data.
Readability: It’s easier to understand what each column and row represents.
Efficiency: Facilitates the use of vectorized operations and grouping functions.

💡 Professional Tip: Before any analysis, convert your data to tidy format. It will save you hours of debugging and confusing code.

3. Introduction to the Tidyverse

The tidyverse is a collection of R packages designed to work together and facilitate data analysis under the tidy philosophy. It was created and is primarily maintained by Hadley Wickham and the Posit team (formerly RStudio).

Main Packages in the Tidyverse

Package	Primary Purpose	Key Functions
`ggplot2`	Data visualization	`ggplot()`, `geom_point()`, `aes()`
`dplyr`	Data frame manipulation	`filter()`, `select()`, `mutate()`, `group_by()`, `summarise()`
`tidyr`	Data cleaning and reshaping	`pivot_longer()`, `pivot_wider()`, `separate()`, `unite()`
`readr`	Data import (CSV, TSV, etc.)	`read_csv()`, `write_csv()`
`purrr`	Functional programming (iteration over lists/vectors)	`map()`, `map_dbl()`, `walk()`
`tibble`	Modernization of data frames	`tibble()`, `as_tibble()`
`stringr`	String manipulation	`str_detect()`, `str_replace()`, `str_split()`
`forcats`	Factor manipulation	`fct_reorder()`, `fct_lump()`, `fct_infreq()`
`lubridate`	Date and time manipulation	`ymd()`, `dmy()`, `interval()`, `now()`

📦 Note: Some packages like stringr, forcats, and lubridate are not automatically loaded with library(tidyverse), but they are installed alongside it.

4. Installing and Loading the Tidyverse

Installation (one-time only)

install.packages("tidyverse")

Loading in each session

library(tidyverse)
# ── Attaching core tidyverse packages ────────────────────────
# ✔ dplyr     1.1.4     ✔ readr     2.1.5
# ✔ forcats   1.0.0     ✔ stringr   1.5.1
# ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
# ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
# ✔ purrr     1.0.2
# ── Conflicts ──────────────────────────────────────────────
# ✖ dplyr::filter() masks stats::filter()
# ✖ dplyr::lag()    masks stats::lag()

⚠️ Warning: The “Conflicts” message indicates that some tidyverse functions (like filter() and lag()) mask functions from the base stats package. This is normal and intended. To use the base version, you can specify stats::filter().

5. The Pipe Operator (`%>%`)

The %>% operator (pronounced “pipe”) is a fundamental tool in the tidyverse. It allows you to chain operations from left to right, improving readability and avoiding complex nesting.

Syntax

object %>% function(arguments)

This is equivalent to:

function(object, arguments)

Example without pipe (nested, hard to read)

head(select(filter(mtcars, cyl == 4), mpg, hp))

Example with pipe (clear and sequential)

mtcars %>%
  filter(cyl == 4) %>%
  select(mpg, hp) %>%
  head()

💡 RStudio Shortcut: Ctrl + Shift + M (Windows/Linux) or Cmd + Shift + M (Mac) inserts %>%.

Modern Pipes: `|>` (Base R 4.1+)

Since R 4.1, there is a native pipe: |>. It works similarly but does not allow the use of . as a placeholder. For most tidyverse use cases, %>% is still preferred for its flexibility.

6. Tidy Workflow: A Practical Example

We will analyze the gapminder dataset (install with install.packages("gapminder")), which contains data on life expectancy, population, and GDP per country and year.

# Load libraries
library(tidyverse)
library(gapminder)

# Explore structure (already in tidy format!)
gapminder %>%
  glimpse()
# Rows: 1,704
# Columns: 6
# $ country   <fct> "Afghanistan", "Afghanistan", ...
# $ continent <fct> Asia, Asia, ...
# $ year      <int> 1952, 1957, ...
# $ lifeExp   <dbl> 28.8, 30.3, ...
# $ pop       <int> 8425333, 9240934, ...
# $ gdpPercap <dbl> 779, 821, ...

# Filter for Americas, select columns, create new variable
gapminder %>%
  filter(continent == "Americas") %>%
  select(country, year, lifeExp, gdpPercap) %>%
  mutate(gdp_total = gdpPercap * pop) %>%
  arrange(desc(gdp_total)) %>%
  head()

This workflow demonstrates:

Each variable in a column (country, year, lifeExp, etc.).
Each observation (country-year) in a row.
Use of pipes to chain clear operations.
Creation of new variables with mutate().

7. Best Practices with Tidyverse

Always work with data in tidy format. Use tidyr to transform it if it isn't.
Use pipes (%>%) to improve readability. Avoid deep nesting.
Load tidyverse at the start of your scripts. Ensures consistency in the environment.
Prefer tibble over data.frame. Avoids unexpected behaviors (like automatic string-to-factor conversion).
Use dplyr verbs instead of base subsetting. They are more explicit and readable.
Document your transformations. Use comments to explain complex steps.

8. Practical Exercise

Objective: Transform a non-tidy dataset into tidy format and perform a basic exploratory analysis.

Dataset: Quarterly Sales

sales <- data.frame(
  region = c("North", "South", "East", "West"),
  Q1 = c(250, 300, 200, 350),
  Q2 = c(270, 310, 210, 360),
  Q3 = c(260, 305, 205, 355),
  Q4 = c(280, 320, 220, 370)
)

Tasks:

Load the tidyverse.
Use pivot_longer() to convert sales into long (tidy) format.
Rename the resulting columns to quarter and amount.
Filter regions where the amount is greater than 300.
Calculate the average amount per quarter.
Save the result in an object named quarterly_summary.

Expected Solution:

library(tidyverse)

tidy_sales <- sales %>%
  pivot_longer(cols = Q1:Q4,
               names_to = "quarter",
               values_to = "amount") %>%
  filter(amount > 300) %>%
  group_by(quarter) %>%
  summarise(average = mean(amount)) %>%
  ungroup()

quarterly_summary <- tidy_sales
print(quarterly_summary)

9. Additional Resources

Original paper: “Tidy Data” by Hadley Wickham (https://vita.had.co.nz/papers/tidy-data.pdf)
Book: R for Data Science — Chapter 4: “Tidy Data” (https://r4ds.had.co.nz/tidy-data.html)
Official tidyverse cheatsheet: https://www.rstudio.com/resources/cheatsheets/
Video: “What is Tidy Data?” — R Programming 101 (YouTube)

✅ Conclusion: Mastering the tidy philosophy and the tidyverse ecosystem is the first step toward efficient, reproducible, and professional data analysis in R. In upcoming units, we will delve deeper into each tool to manipulate, clean, and transform your data like an expert.

← Module04 Module06 →

Course Info

Course: R-zero-to-hero

Language: EN

Lesson: Module05