tidyrBy the end of this unit, the student will be able to:
pivot_longer() and pivot_wider().separate(), unite(), and extract().tidyr and the Concept of “Tidy Data”The tidyr package is a fundamental part of the tidyverse ecosystem and is specifically designed to clean and restructure data so that it adheres to the principles of tidy data (rectangular and orderly), as defined by Hadley Wickham:
Tidy data follows three rules:
- Each variable forms a column.
- Each observation forms a row.
- Each value is a cell.
Many real-world datasets do not comply with these rules. For example:
tidyr provides intuitive and consistent functions to solve these problems.
pivot_longer(): From Wide to LongConverts columns into rows. Useful when multiple columns represent measurements of the same variable under different conditions.
library(tidyr)
library(dplyr)
# Example: survey dataset with columns per year
encuesta_ancho <- tibble(
pais = c("España", "México", "Argentina"),
`2020` = c(75, 70, 78),
`2021` = c(77, 72, 80),
`2022` = c(79, 74, 81)
)
# Convert to long format
encuesta_largo <- encuesta_ancho %>%
pivot_longer(
cols = c(`2020`, `2021`, `2022`), # columns to pivot
names_to = "año", # name of the new column for names
values_to = "esperanza_vida" # name of the new column for values
)
# Result:
# pais año esperanza_vida
# España 2020 75
# España 2021 77
# ...
🔹 Key Parameters:
cols: columns to transform (you can use starts_with(), ends_with(), matches(), or ranges like año_2020:año_2022).names_to: name of the column that will contain the former column names.values_to: name of the column that will contain the values.pivot_wider(): From Long to WideConverts rows into columns. Useful for summaries, cross-tabulations, or visualizations.
# Starting from the previous long dataset
encuesta_largo %>%
pivot_wider(
names_from = año, # column whose values will become column names
values_from = esperanza_vida # column whose values will fill the new columns
)
# Result:
# pais `2020` `2021` `2022`
# España 75 77 79
# México 70 72 74
# ...
🔹 Key Parameters:
names_from: column that defines the new column names.values_from: column that provides the values.values_fill: value to fill empty cells (default is NA).names_prefix, names_sep: to customize resulting names.separate(): Split One Column into MultipleSplits a column into several using a separator (e.g., hyphen, space, semicolon).
datos <- tibble(
nombre_completo = c("Ana García", "Luis Pérez", "María López")
)
datos_separado <- datos %>%
separate(
col = nombre_completo,
into = c("nombre", "apellido"),
sep = " " # separator: space
)
# Result:
# nombre apellido
# Ana García
# Luis Pérez
# María López
🔹 Key Parameters:
col: column to split.into: vector of names for the new columns.sep: separator (can be a string, numeric position, or regex).remove: whether to remove the original column (TRUE by default).convert: attempts to automatically convert to numeric or logical type (FALSE by default).unite(): Combine Multiple Columns into OneCombines several columns into one, with an optional separator.
datos_separado %>%
unite(
col = nombre_completo,
nombre, apellido,
sep = " - "
)
# Result:
# nombre_completo
# Ana - García
# Luis - Pérez
# ...
🔹 Key Parameters:
col: name of the new combined column....: columns to unite (you can use :, c(), or individual names).sep: separator between values.remove: whether to remove the original columns.extract(): Split Using Regular ExpressionsSimilar to separate(), but uses regex to extract groups defined by parentheses.
datos_fecha <- tibble(
fecha_texto = c("2023-05-15", "2024-12-01", "2022-07-30")
)
datos_fecha %>%
extract(
fecha_texto,
into = c("año", "mes", "día"),
regex = "(\\d{4})-(\\d{2})-(\\d{2})"
)
# Result:
# año mes día
# 2023 05 15
# 2024 12 01
# ...
🔹 Key Parameters:
regex: pattern with capture groups ( ).remove, convert: same as in separate().NA)drop_na(): Remove Rows with Missing ValuesRemoves rows containing NA in the specified columns.
datos_con_na <- tibble(
x = c(1, 2, NA, 4),
y = c("a", NA, "c", "d"),
z = c(10, 20, 30, NA)
)
# Remove rows with NA in any column
datos_con_na %>% drop_na()
# Remove rows with NA only in column 'x'
datos_con_na %>% drop_na(x)
# Remove rows with NA in columns 'x' or 'y'
datos_con_na %>% drop_na(x, y)
replace_na(): Replace Missing ValuesReplaces NA with a specific value, column by column.
datos_con_na %>%
replace_na(
list(
x = 0,
y = "desconocido",
z = mean(datos_con_na$z, na.rm = TRUE)
)
)
🔹 Important: You must pass a list with replacements per column.
fill(): Fill Forward or BackwardUseful for data where a value applies to several consecutive rows (e.g., group headers).
ventas <- tibble(
region = c("Norte", NA, NA, "Sur", NA, NA),
ventas = c(100, 150, 200, 300, 400, 500)
)
ventas %>%
fill(region, .direction = "down")
# Result:
# region ventas
# Norte 100
# Norte 150
# Norte 200
# Sur 300
# Sur 400
# Sur 500
🔹 Parameters:
.direction: "down" (default), "up", "downup", "updown".forcats (integrated into tidyverse)Although technically belonging to forcats, its use is essential for cleaning categorical data.
fct_reorder(): Reorder Levels by Another VariableOrders the levels of a factor according to a function applied to another variable (e.g., mean, sum).
library(ggplot2) # for mpg
mpg %>%
mutate(class = fct_reorder(class, hwy, .fun = median)) %>%
ggplot(aes(x = class, y = hwy)) +
geom_boxplot() +
coord_flip()
fct_infreq() and fct_inorder()fct_infreq(): orders by descending frequency.fct_inorder(): orders by order of appearance.ejemplo <- tibble(
categoria = c("B", "A", "C", "A", "B", "A")
)
ejemplo %>%
mutate(
por_frecuencia = fct_infreq(categoria),
por_orden = fct_inorder(categoria)
)
fct_collapse() and fct_lump(): Group Infrequent Levels# Group manually
ejemplo %>%
mutate(categoria_agrupada = fct_collapse(categoria,
Grupo1 = c("A", "B"),
Otros = "C"
))
# Group automatically: levels with frequency < n or % become "Other"
ejemplo %>%
mutate(categoria_lump = fct_lump(categoria, n = 2)) # keeps the 2 most frequent
Dataset: Satisfaction survey data by region and quarter (wide format, with combined columns and missing values).
encuesta_sucia <- tibble(
"Region-Trimestre" = c("Norte-Q1", "Sur-Q2", "Este-Q1", "Oeste-Q3"),
Satisfaccion_2023 = c(85, NA, 90, 78),
Satisfaccion_2024 = c(88, 82, NA, 80),
Comentarios = c("Muy bueno", "Mejorable", "Excelente", NA)
)
# Step 1: Split combined column
encuesta_limpia <- encuesta_sucia %>%
separate(`Region-Trimestre`, into = c("region", "trimestre"), sep = "-", remove = TRUE)
# Step 2: Convert to long format
encuesta_limpia <- encuesta_limpia %>%
pivot_longer(
cols = starts_with("Satisfaccion"),
names_to = "año",
values_to = "puntuacion",
names_prefix = "Satisfaccion_"
)
# Step 3: Handle missing values
encuesta_limpia <- encuesta_limpia %>%
replace_na(list(puntuacion = mean(encuesta_limpia$puntuacion, na.rm = TRUE)))
# Step 4: Reorder regions by average score
encuesta_limpia <- encuesta_limpia %>%
mutate(region = fct_reorder(region, puntuacion, .fun = mean))
# Step 5: Visualize
library(ggplot2)
encuesta_limpia %>%
ggplot(aes(x = region, y = puntuacion, fill = año)) +
geom_col(position = "dodge") +
labs(title = "Satisfaction by Region and Year") +
theme_minimal()
fill() to complete a “category” column that only has a value in the first row of each group.fct_lump() to group all categories with fewer than 5 observations into “Others”.tidyr and generate a plot with ggplot2.glimpse(), count(), summary().names_to and values_to.tidyr with dplyr: filter before pivoting, summarize after.convert = TRUE in separate() if you expect numbers.tidyr documentation: https://tidyr.tidyverse.org/ ✅ With this unit, the student masters the essential tools to prepare any dataset for analysis, visualization, or modeling, following modern standards of the R ecosystem.