In the real world, data rarely arrives perfectly structured. Much of a data scientistâs work involves cleaning, transforming, and preparing data for analysis. Three data types that require special attention are:
In this unit, you will learn to master handling these three data types using tidyverse packages: stringr for strings, lubridate for dates, and forcats for factors. These tools will enable you to transform messy data into clear, consistent information ready for analysis.
stringrThe stringr package provides a coherent, intuitive, and efficient interface for working with text strings in R. All its functions start with str_, making them easy to discover and use.
library(stringr)
library(dplyr)
# Sample data
names <- c("Ana GarcĂa", "Carlos Ruiz", "MarĂa LĂłpez", "JUAN PEREZ")
# Detect patterns
str_detect(names, "a") # TRUE if contains "a" (case-sensitive)
str_detect(names, regex("a", ignore_case = TRUE)) # Case-insensitive
# Count occurrences
str_count(names, "a") # Number of "a"s in each string
# Locate position
str_locate(names, "a") # Position of first "a"
str_locate_all(names, "a") # Positions of all "a"s
# Extract substrings
str_extract(names, "[A-Z]+") # Extracts first uppercase sequence
str_extract_all(names, "[A-Z]+") # Extracts all uppercase sequences
# Replace text
str_replace(names, "a", "X") # Replaces first "a" with "X"
str_replace_all(names, "a", "X") # Replaces all "a"s with "X"
# Split strings
str_split(names, " ", simplify = TRUE) # Splits by space
Regular expressions are patterns that describe sets of strings. stringr fully supports them.
emails <- c("ana@gmail.com", "carlos@outlook.es", "invalido", "maria@empresa.org")
# Validate emails (basic)
str_detect(emails, ".+@.+\\..+")
# Extract domain
str_extract(emails, "@(.+)$") %>% str_replace("@", "")
# Clean text: only letters and spaces
dirty_texts <- c("Hola123!", "¿Qué tal?", "Precio: $50")
str_replace_all(dirty_texts, "[^A-Za-zĂĂĂĂĂĂĄĂ©ĂĂłĂșñà ]", "")
# Sample dataset
customers <- tibble(
full_name = c("GarcĂa, Ana", "Ruiz, Carlos", "LĂłpez, MarĂa"),
email = c("ana@gmail.com", "carlos@outlook.es", "maria@empresa.org"),
phone = c("(555) 123-4567", "555-987-6543", "555 321 7890")
)
# Separate first and last name
customers <- customers %>%
mutate(
last_name = str_extract(full_name, "^[^,]+"),
first_name = str_extract(full_name, "[^,]+$") %>% str_trim(),
email_domain = str_extract(email, "@(.+)$") %>% str_replace("@", ""),
clean_phone = str_replace_all(phone, "[^0-9]", "")
)
customers
lubridatelubridate simplifies working with dates and times in R. It provides intuitive functions to parse, manipulate, and format dates.
library(lubridate)
# Functions by order: y=year, m=month, d=day
text_dates <- c("2023-12-01", "01/12/2023", "Dec 1, 2023", "20231201")
ymd(text_dates[1]) # 2023-12-01
dmy(text_dates[2]) # 2023-12-01
mdy(text_dates[3]) # 2023-12-01
ymd(text_dates[4]) # 2023-12-01
# Flexible parsing
parse_date_time(text_dates, orders = c("ymd", "dmy", "mdy", "Ymd"))
today <- ymd("2024-06-15")
# Extract components
year(today) # 2024
month(today) # 6
day(today) # 15
wday(today, label = TRUE) # "Sat"
# Modify components
today %>%
update(year = 2025, month = 12) # 2025-12-15
# Add/subtract time
today + days(10) # 2024-06-25
today + months(1) # 2024-07-15
today + years(1) # 2025-06-15
# Differences
difftime(ymd("2025-01-01"), today, units = "days")
start <- ymd_hms("2024-06-01 08:00:00")
end <- ymd_hms("2024-06-01 17:30:00")
# Duration (physical time)
duration <- end - start
as.duration(duration) # 34200s (~9.5 hours)
# Period (calendar time)
period <- months(1)
start + period # 2024-07-01 08:00:00
# Intervals
interval <- interval(start, end)
int_start(interval)
int_end(interval)
int_length(interval) # in seconds
# Is a date within an interval?
ymd("2024-06-02") %within% interval # FALSE
# Create with time zone
ny_time <- ymd_hms("2024-06-15 12:00:00", tz = "America/New_York")
london_time <- with_tz(ny_time, "Europe/London") # 17:00:00
# Convert time zone (changes the time)
ny_converted <- force_tz(ny_time, "Europe/London") # still 12:00, but in London
sales <- tibble(
sale_date = c("2024-01-15", "2024-02-20", "2024-03-10", "2024-04-05"),
sale_time = c("14:30:00", "09:15:00", "16:45:00", "11:20:00"),
amount = c(150.50, 200.00, 75.25, 300.75)
)
clean_sales <- sales %>%
mutate(
datetime = ymd_hms(paste(sale_date, sale_time)),
year = year(datetime),
month = month(datetime, label = TRUE),
weekday = wday(datetime, label = TRUE),
hour = hour(datetime)
) %>%
select(-sale_date, -sale_time)
clean_sales
forcatsFactors are categorical variables. forcats provides tools to intuitively reorder, recode, and manipulate factor levels.
library(forcats)
# Sample data
countries <- c("México", "Argentina", "Brasil", "Chile", "Argentina", "México", "Brasil")
# Convert to factor
countries_f <- factor(countries)
# Reorder by frequency
fct_infreq(countries_f) # Brasil, Argentina, México, Chile (ordered by frequency)
# Reorder manually
fct_relevel(countries_f, "Brasil", "Argentina", "México", "Chile")
# Reorder by another variable (e.g., GDP)
gdp <- c(México = 1.5, Argentina = 0.6, Brasil = 2.1, Chile = 0.3)
fct_reorder(countries_f, gdp[countries], .fun = identity)
# Recode manually
regions <- fct_recode(countries_f,
"North America" = "México",
"South America" = "Argentina",
"South America" = "Brasil",
"South America" = "Chile"
)
# Collapse infrequent levels
set.seed(123)
categories <- sample(c("A", "B", "C", "D", "E", "F"), 100, replace = TRUE)
categories_f <- factor(categories)
# Keep only top 3 most frequent, rest as "Other"
fct_lump_n(categories_f, n = 3)
# Collapse by minimum proportion
fct_lump_prop(categories_f, prop = 0.15)
# Collapse by minimum number of observations
fct_lump_min(categories_f, min = 15)
# Reverse levels
fct_rev(fct_infreq(countries_f))
# Expand levels (useful for plots)
fct_expand(countries_f, "PerĂș", "Colombia")
# Remove unused levels
countries_sub <- countries_f[countries_f != "Chile"]
countries_sub # Chile still in levels
fct_drop(countries_sub) # Chile removed from levels
# Anonymize levels
fct_anon(countries_f) # Replaces levels with .1, .2, .3...
# Survey dataset
survey <- tibble(
satisfaction = c("Very Dissatisfied", "Dissatisfied", "Neutral", "Satisfied", "Very Satisfied"),
frequency = c(5, 15, 25, 40, 15)
)
# Convert to factor with logical order
survey <- survey %>%
mutate(
satisfaction_f = factor(satisfaction,
levels = c("Very Dissatisfied", "Dissatisfied",
"Neutral", "Satisfied", "Very Satisfied"))
)
# For models, sometimes we want to order by frequency
survey %>%
mutate(
satisfaction_freq = fct_infreq(satisfaction_f)
)
# Or collapse extreme categories
survey %>%
mutate(
sat_collapsed = fct_collapse(satisfaction_f,
"Dissatisfied" = c("Very Dissatisfied", "Dissatisfied"),
"Satisfied" = c("Satisfied", "Very Satisfied"),
"Neutral" = "Neutral"
)
)
We will apply everything learned to a real dataset: customer satisfaction surveys.
library(tidyverse)
library(lubridate)
library(forcats)
# Simulate a messy dataset
set.seed(123)
dirty_data <- tibble(
id = 1:100,
name = paste(sample(c("Juan", "MarĂa", "Carlos", "Ana", "Luis"), 100, replace = TRUE),
sample(c("GĂłmez", "PĂ©rez", "LĂłpez", "Ruiz", "DĂaz"), 100, replace = TRUE)),
email = paste0(tolower(sample(letters, 100, replace = TRUE)),
sample(100:999, 100, replace = TRUE), "@",
sample(c("gmail.com", "hotmail.com", "empresa.org", "univ.edu"), 100, replace = TRUE)),
survey_date = sample(seq(ymd("2023-01-01"), ymd("2024-06-15"), by = "day"), 100, replace = TRUE),
survey_time = sprintf("%02d:%02d", sample(9:18, 100, replace = TRUE), sample(0:59, 100, replace = TRUE)),
satisfaction = sample(c("Very Dissatisfied", "dissatisfied", "NEUTRAL", "Satisfied ", "Very Satisfied!"), 100, replace = TRUE),
comments = c(
rep("Excellent service, very fast", 20),
rep("Slow and unfriendly", 15),
rep("Good, but can be improved", 25),
rep("Horrible, never again", 10),
rep("Very good, I'll be back", 30)
)
)
# Complete cleaning
clean_data <- dirty_data %>%
# Clean satisfaction
mutate(
satisfaction = str_to_title(str_trim(str_replace_all(satisfaction, "[^A-Za-z ]", ""))),
satisfaction_f = factor(satisfaction,
levels = c("Very Dissatisfied", "Dissatisfied", "Neutral", "Satisfied", "Very Satisfied"))
) %>%
# Extract email domain
mutate(
domain = str_extract(email, "@(.+)$") %>% str_replace("@", "")
) %>%
# Combine date and time
mutate(
datetime = ymd_hm(paste(survey_date, survey_time)),
month = month(datetime, label = TRUE),
weekday = wday(datetime, label = TRUE)
) %>%
# Clean comments and create sentiment variable
mutate(
comments = str_to_sentence(str_trim(comments)),
sentiment = case_when(
str_detect(comments, regex("excellent|very good|I'll be back", ignore_case = TRUE)) ~ "Positive",
str_detect(comments, regex("slow|unfriendly|horrible|never", ignore_case = TRUE)) ~ "Negative",
TRUE ~ "Neutral"
),
sentiment_f = fct_relevel(factor(sentiment), "Negative", "Neutral", "Positive")
) %>%
# Select final columns
select(id, name, email, domain, datetime, month, weekday, satisfaction_f, sentiment_f, comments)
# View result
glimpse(clean_data)
head(clean_data)
# Exploratory analysis
clean_data %>%
count(satisfaction_f) %>%
mutate(pct = n / sum(n))
clean_data %>%
count(month, satisfaction_f) %>%
ggplot(aes(x = month, y = n, fill = satisfaction_f)) +
geom_col(position = "dodge") +
labs(title = "Satisfaction by Month", x = "Month", y = "Number of Surveys") +
theme_minimal()
| Type | Function | Description |
|---|---|---|
| Strings | str_detect() |
Detects if pattern matches |
str_replace_all() |
Replaces all matches | |
str_extract() |
Extracts first match | |
str_split() |
Splits string into parts | |
| Dates | ymd(), dmy(), mdy() |
Parses dates by order |
year(), month(), day() |
Extracts date components | |
ymd() + days(10) |
Adds days, months, years to date | |
interval(), %within% |
Creates and evaluates time intervals | |
| Factors | fct_relevel() |
Reorders levels manually |
fct_infreq() |
Orders levels by frequency | |
fct_lump_n() |
Collapses least frequent levels | |
fct_collapse() |
Groups specific levels into a new one |
Name Cleaning: Given a vector of names with inconsistent formats (UPPERCASE, lowercase, with titles like "Dr."), normalize them to "First Last" with initial capitalization.
Domain Extraction: From a list of emails, extract the domain and count how many there are per domain. Then, collapse all domains with fewer than 5 occurrences into "Other".
Date Conversion: You have dates in "DD-MM-YYYY" and "MM/DD/YYYY" formats mixed together. Convert them all to Râs standard Date format.
Category Reordering: In a sales dataset by product, reorder the products in the bar chart according to total sales amount (from highest to lowest).
Mini Project: Load a real dataset (e.g., from Kaggle) containing at least one text column, one date column, and one categorical column. Apply all techniques from this unit to clean it and prepare it for analysis.
â With this unit, you master the essential tools to clean and transform the most challenging data! You are ready to tackle real-world datasets with confidence.