By the end of this unit, the student will be able to:
tidyverse ecosystem.tidyverse and their specific purposes.%>%) to chain operations in a readable manner.The tidy data philosophy was formalized by Hadley Wickham in his paper โTidy Dataโ (2014). It defines a standard for structuring datasets to facilitate analysis, visualization, and manipulation. A dataset is in tidy format if it adheres to the following three rules:
A variable represents a measured quantity or characteristic (e.g., age, income, country, date). In tidy data, each variable must occupy exactly one column.
โ Non-tidy Example:
# The "2020", "2021", "2022" columns represent years โ these are values, not variables!
annual_sales <- data.frame(
product = c("A", "B"),
`2020` = c(100, 150),
`2021` = c(120, 160),
`2022` = c(130, 170)
)
โ TIDY Example:
library(tidyr)
long_sales <- annual_sales %>%
pivot_longer(cols = c(`2020`, `2021`, `2022`),
names_to = "year",
values_to = "sales")
# Now "year" is a column (variable) and "sales" is another โ each variable in its own column!
An observation is a unique instance of measurement (e.g., the sale of a product in a specific year). Each observation must occupy exactly one row.
Each cell must contain a single atomic value. Multiple values should not be combined in one cell (e.g., โMadrid, Spainโ should be split into two columns: city and country).
tidyverse tools are designed to work with tidy data.๐ก Professional Tip: Before any analysis, convert your data to tidy format. It will save you hours of debugging and confusing code.
The tidyverse is a collection of R packages designed to work together and facilitate data analysis under the tidy philosophy. It was created and is primarily maintained by Hadley Wickham and the Posit team (formerly RStudio).
| Package | Primary Purpose | Key Functions |
|---|---|---|
ggplot2 |
Data visualization | ggplot(), geom_point(), aes() |
dplyr |
Data frame manipulation | filter(), select(), mutate(), group_by(), summarise() |
tidyr |
Data cleaning and reshaping | pivot_longer(), pivot_wider(), separate(), unite() |
readr |
Data import (CSV, TSV, etc.) | read_csv(), write_csv() |
purrr |
Functional programming (iteration over lists/vectors) | map(), map_dbl(), walk() |
tibble |
Modernization of data frames | tibble(), as_tibble() |
stringr |
String manipulation | str_detect(), str_replace(), str_split() |
forcats |
Factor manipulation | fct_reorder(), fct_lump(), fct_infreq() |
lubridate |
Date and time manipulation | ymd(), dmy(), interval(), now() |
๐ฆ Note: Some packages like
stringr,forcats, andlubridateare not automatically loaded withlibrary(tidyverse), but they are installed alongside it.
install.packages("tidyverse")
library(tidyverse)
# โโ Attaching core tidyverse packages โโโโโโโโโโโโโโโโโโโโโโโโ
# โ dplyr 1.1.4 โ readr 2.1.5
# โ forcats 1.0.0 โ stringr 1.5.1
# โ ggplot2 3.5.1 โ tibble 3.2.1
# โ lubridate 1.9.3 โ tidyr 1.3.1
# โ purrr 1.0.2
# โโ Conflicts โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# โ dplyr::filter() masks stats::filter()
# โ dplyr::lag() masks stats::lag()
โ ๏ธ Warning: The โConflictsโ message indicates that some
tidyversefunctions (likefilter()andlag()) mask functions from the basestatspackage. This is normal and intended. To use the base version, you can specifystats::filter().
%>%)The %>% operator (pronounced โpipeโ) is a fundamental tool in the tidyverse. It allows you to chain operations from left to right, improving readability and avoiding complex nesting.
object %>% function(arguments)
This is equivalent to:
function(object, arguments)
head(select(filter(mtcars, cyl == 4), mpg, hp))
mtcars %>%
filter(cyl == 4) %>%
select(mpg, hp) %>%
head()
๐ก RStudio Shortcut:
Ctrl + Shift + M(Windows/Linux) orCmd + Shift + M(Mac) inserts%>%.
|> (Base R 4.1+)Since R 4.1, there is a native pipe: |>. It works similarly but does not allow the use of . as a placeholder. For most tidyverse use cases, %>% is still preferred for its flexibility.
We will analyze the gapminder dataset (install with install.packages("gapminder")), which contains data on life expectancy, population, and GDP per country and year.
# Load libraries
library(tidyverse)
library(gapminder)
# Explore structure (already in tidy format!)
gapminder %>%
glimpse()
# Rows: 1,704
# Columns: 6
# $ country <fct> "Afghanistan", "Afghanistan", ...
# $ continent <fct> Asia, Asia, ...
# $ year <int> 1952, 1957, ...
# $ lifeExp <dbl> 28.8, 30.3, ...
# $ pop <int> 8425333, 9240934, ...
# $ gdpPercap <dbl> 779, 821, ...
# Filter for Americas, select columns, create new variable
gapminder %>%
filter(continent == "Americas") %>%
select(country, year, lifeExp, gdpPercap) %>%
mutate(gdp_total = gdpPercap * pop) %>%
arrange(desc(gdp_total)) %>%
head()
This workflow demonstrates:
country, year, lifeExp, etc.).mutate().tidyr to transform it if it isn't.%>%) to improve readability. Avoid deep nesting.tidyverse at the start of your scripts. Ensures consistency in the environment.tibble over data.frame. Avoids unexpected behaviors (like automatic string-to-factor conversion).dplyr verbs instead of base subsetting. They are more explicit and readable.Objective: Transform a non-tidy dataset into tidy format and perform a basic exploratory analysis.
sales <- data.frame(
region = c("North", "South", "East", "West"),
Q1 = c(250, 300, 200, 350),
Q2 = c(270, 310, 210, 360),
Q3 = c(260, 305, 205, 355),
Q4 = c(280, 320, 220, 370)
)
tidyverse.pivot_longer() to convert sales into long (tidy) format.quarter and amount.quarterly_summary.library(tidyverse)
tidy_sales <- sales %>%
pivot_longer(cols = Q1:Q4,
names_to = "quarter",
values_to = "amount") %>%
filter(amount > 300) %>%
group_by(quarter) %>%
summarise(average = mean(amount)) %>%
ungroup()
quarterly_summary <- tidy_sales
print(quarterly_summary)
tidyverse cheatsheet: https://www.rstudio.com/resources/cheatsheets/โ
Conclusion: Mastering the tidy philosophy and the tidyverse ecosystem is the first step toward efficient, reproducible, and professional data analysis in R. In upcoming units, we will delve deeper into each tool to manipulate, clean, and transform your data like an expert.