📘 Unit 2.5: String, Date, and Factor Handling

Introduction

In the real world, data rarely arrives perfectly structured. Much of a data scientist’s work involves cleaning, transforming, and preparing data for analysis. Three data types that require special attention are:

Strings: names, addresses, comments, misspelled categories.
Dates and Times: inconsistent formats, time zones, duration calculations.
Factors (categorical variables): unordered levels, infrequent categories, confusing labels.

In this unit, you will learn to master handling these three data types using tidyverse packages: stringr for strings, lubridate for dates, and forcats for factors. These tools will enable you to transform messy data into clear, consistent information ready for analysis.

1. String Manipulation with `stringr`

The stringr package provides a coherent, intuitive, and efficient interface for working with text strings in R. All its functions start with str_, making them easy to discover and use.

1.1. Basic Functions

library(stringr)
library(dplyr)

# Sample data
names <- c("Ana García", "Carlos Ruiz", "María López", "JUAN PEREZ")

# Detect patterns
str_detect(names, "a")          # TRUE if contains "a" (case-sensitive)
str_detect(names, regex("a", ignore_case = TRUE))  # Case-insensitive

# Count occurrences
str_count(names, "a")           # Number of "a"s in each string

# Locate position
str_locate(names, "a")          # Position of first "a"
str_locate_all(names, "a")      # Positions of all "a"s

# Extract substrings
str_extract(names, "[A-Z]+")    # Extracts first uppercase sequence
str_extract_all(names, "[A-Z]+") # Extracts all uppercase sequences

# Replace text
str_replace(names, "a", "X")    # Replaces first "a" with "X"
str_replace_all(names, "a", "X") # Replaces all "a"s with "X"

# Split strings
str_split(names, " ", simplify = TRUE) # Splits by space

1.2. Patterns with Regular Expressions (Regex)

Regular expressions are patterns that describe sets of strings. stringr fully supports them.

emails <- c("ana@gmail.com", "carlos@outlook.es", "invalido", "maria@empresa.org")

# Validate emails (basic)
str_detect(emails, ".+@.+\\..+")

# Extract domain
str_extract(emails, "@(.+)$") %>% str_replace("@", "")

# Clean text: only letters and spaces
dirty_texts <- c("Hola123!", "¿Qué tal?", "Precio: $50")
str_replace_all(dirty_texts, "[^A-Za-zÁÉÍÓÚáéíóúñÑ ]", "")

1.3. Practical Cases in Data Frames

# Sample dataset
customers <- tibble(
  full_name = c("García, Ana", "Ruiz, Carlos", "López, María"),
  email = c("ana@gmail.com", "carlos@outlook.es", "maria@empresa.org"),
  phone = c("(555) 123-4567", "555-987-6543", "555 321 7890")
)

# Separate first and last name
customers <- customers %>%
  mutate(
    last_name = str_extract(full_name, "^[^,]+"),
    first_name = str_extract(full_name, "[^,]+$") %>% str_trim(),
    email_domain = str_extract(email, "@(.+)$") %>% str_replace("@", ""),
    clean_phone = str_replace_all(phone, "[^0-9]", "")
  )

customers

2. Date and Time Handling with `lubridate`

lubridate simplifies working with dates and times in R. It provides intuitive functions to parse, manipulate, and format dates.

2.1. Date Parsing

library(lubridate)

# Functions by order: y=year, m=month, d=day
text_dates <- c("2023-12-01", "01/12/2023", "Dec 1, 2023", "20231201")

ymd(text_dates[1])   # 2023-12-01
dmy(text_dates[2])   # 2023-12-01
mdy(text_dates[3])   # 2023-12-01
ymd(text_dates[4])   # 2023-12-01

# Flexible parsing
parse_date_time(text_dates, orders = c("ymd", "dmy", "mdy", "Ymd"))

2.2. Components and Operations

today <- ymd("2024-06-15")

# Extract components
year(today)    # 2024
month(today)   # 6
day(today)     # 15
wday(today, label = TRUE)  # "Sat"

# Modify components
today %>% 
  update(year = 2025, month = 12)  # 2025-12-15

# Add/subtract time
today + days(10)        # 2024-06-25
today + months(1)       # 2024-07-15
today + years(1)        # 2025-06-15

# Differences
difftime(ymd("2025-01-01"), today, units = "days")

2.3. Intervals, Durations, and Periods

start <- ymd_hms("2024-06-01 08:00:00")
end <- ymd_hms("2024-06-01 17:30:00")

# Duration (physical time)
duration <- end - start
as.duration(duration)  # 34200s (~9.5 hours)

# Period (calendar time)
period <- months(1)
start + period        # 2024-07-01 08:00:00

# Intervals
interval <- interval(start, end)
int_start(interval)
int_end(interval)
int_length(interval)  # in seconds

# Is a date within an interval?
ymd("2024-06-02") %within% interval  # FALSE

2.4. Time Zones

# Create with time zone
ny_time <- ymd_hms("2024-06-15 12:00:00", tz = "America/New_York")
london_time <- with_tz(ny_time, "Europe/London")  # 17:00:00

# Convert time zone (changes the time)
ny_converted <- force_tz(ny_time, "Europe/London")  # still 12:00, but in London

2.5. Practical Case: Sales Analysis by Time

sales <- tibble(
  sale_date = c("2024-01-15", "2024-02-20", "2024-03-10", "2024-04-05"),
  sale_time = c("14:30:00", "09:15:00", "16:45:00", "11:20:00"),
  amount = c(150.50, 200.00, 75.25, 300.75)
)

clean_sales <- sales %>%
  mutate(
    datetime = ymd_hms(paste(sale_date, sale_time)),
    year = year(datetime),
    month = month(datetime, label = TRUE),
    weekday = wday(datetime, label = TRUE),
    hour = hour(datetime)
  ) %>%
  select(-sale_date, -sale_time)

clean_sales

3. Factor Handling with `forcats`

Factors are categorical variables. forcats provides tools to intuitively reorder, recode, and manipulate factor levels.

3.1. Reordering Levels

library(forcats)

# Sample data
countries <- c("México", "Argentina", "Brasil", "Chile", "Argentina", "México", "Brasil")

# Convert to factor
countries_f <- factor(countries)

# Reorder by frequency
fct_infreq(countries_f)  # Brasil, Argentina, México, Chile (ordered by frequency)

# Reorder manually
fct_relevel(countries_f, "Brasil", "Argentina", "México", "Chile")

# Reorder by another variable (e.g., GDP)
gdp <- c(México = 1.5, Argentina = 0.6, Brasil = 2.1, Chile = 0.3)
fct_reorder(countries_f, gdp[countries], .fun = identity)

3.2. Recoding and Collapsing Levels

# Recode manually
regions <- fct_recode(countries_f,
  "North America" = "México",
  "South America" = "Argentina",
  "South America" = "Brasil",
  "South America" = "Chile"
)

# Collapse infrequent levels
set.seed(123)
categories <- sample(c("A", "B", "C", "D", "E", "F"), 100, replace = TRUE)
categories_f <- factor(categories)

# Keep only top 3 most frequent, rest as "Other"
fct_lump_n(categories_f, n = 3)

# Collapse by minimum proportion
fct_lump_prop(categories_f, prop = 0.15)

# Collapse by minimum number of observations
fct_lump_min(categories_f, min = 15)

3.3. Other Useful Functions

# Reverse levels
fct_rev(fct_infreq(countries_f))

# Expand levels (useful for plots)
fct_expand(countries_f, "Perú", "Colombia")

# Remove unused levels
countries_sub <- countries_f[countries_f != "Chile"]
countries_sub  # Chile still in levels
fct_drop(countries_sub)  # Chile removed from levels

# Anonymize levels
fct_anon(countries_f)  # Replaces levels with .1, .2, .3...

3.4. Practical Case: Data Preparation for Models

# Survey dataset
survey <- tibble(
  satisfaction = c("Very Dissatisfied", "Dissatisfied", "Neutral", "Satisfied", "Very Satisfied"),
  frequency = c(5, 15, 25, 40, 15)
)

# Convert to factor with logical order
survey <- survey %>%
  mutate(
    satisfaction_f = factor(satisfaction, 
                           levels = c("Very Dissatisfied", "Dissatisfied", 
                                     "Neutral", "Satisfied", "Very Satisfied"))
  )

# For models, sometimes we want to order by frequency
survey %>%
  mutate(
    satisfaction_freq = fct_infreq(satisfaction_f)
  )

# Or collapse extreme categories
survey %>%
  mutate(
    sat_collapsed = fct_collapse(satisfaction_f,
      "Dissatisfied" = c("Very Dissatisfied", "Dissatisfied"),
      "Satisfied" = c("Satisfied", "Very Satisfied"),
      "Neutral" = "Neutral"
    )
  )

4. Integrated Project: Complete Dataset Cleaning

We will apply everything learned to a real dataset: customer satisfaction surveys.

library(tidyverse)
library(lubridate)
library(forcats)

# Simulate a messy dataset
set.seed(123)
dirty_data <- tibble(
  id = 1:100,
  name = paste(sample(c("Juan", "María", "Carlos", "Ana", "Luis"), 100, replace = TRUE), 
               sample(c("Gómez", "Pérez", "López", "Ruiz", "Díaz"), 100, replace = TRUE)),
  email = paste0(tolower(sample(letters, 100, replace = TRUE)), 
                 sample(100:999, 100, replace = TRUE), "@", 
                 sample(c("gmail.com", "hotmail.com", "empresa.org", "univ.edu"), 100, replace = TRUE)),
  survey_date = sample(seq(ymd("2023-01-01"), ymd("2024-06-15"), by = "day"), 100, replace = TRUE),
  survey_time = sprintf("%02d:%02d", sample(9:18, 100, replace = TRUE), sample(0:59, 100, replace = TRUE)),
  satisfaction = sample(c("Very Dissatisfied", "dissatisfied", "NEUTRAL", "Satisfied ", "Very Satisfied!"), 100, replace = TRUE),
  comments = c(
    rep("Excellent service, very fast", 20),
    rep("Slow and unfriendly", 15),
    rep("Good, but can be improved", 25),
    rep("Horrible, never again", 10),
    rep("Very good, I'll be back", 30)
  )
)

# Complete cleaning
clean_data <- dirty_data %>%
  # Clean satisfaction
  mutate(
    satisfaction = str_to_title(str_trim(str_replace_all(satisfaction, "[^A-Za-z ]", ""))),
    satisfaction_f = factor(satisfaction, 
                           levels = c("Very Dissatisfied", "Dissatisfied", "Neutral", "Satisfied", "Very Satisfied"))
  ) %>%
  # Extract email domain
  mutate(
    domain = str_extract(email, "@(.+)$") %>% str_replace("@", "")
  ) %>%
  # Combine date and time
  mutate(
    datetime = ymd_hm(paste(survey_date, survey_time)),
    month = month(datetime, label = TRUE),
    weekday = wday(datetime, label = TRUE)
  ) %>%
  # Clean comments and create sentiment variable
  mutate(
    comments = str_to_sentence(str_trim(comments)),
    sentiment = case_when(
      str_detect(comments, regex("excellent|very good|I'll be back", ignore_case = TRUE)) ~ "Positive",
      str_detect(comments, regex("slow|unfriendly|horrible|never", ignore_case = TRUE)) ~ "Negative",
      TRUE ~ "Neutral"
    ),
    sentiment_f = fct_relevel(factor(sentiment), "Negative", "Neutral", "Positive")
  ) %>%
  # Select final columns
  select(id, name, email, domain, datetime, month, weekday, satisfaction_f, sentiment_f, comments)

# View result
glimpse(clean_data)
head(clean_data)

# Exploratory analysis
clean_data %>%
  count(satisfaction_f) %>%
  mutate(pct = n / sum(n))

clean_data %>%
  count(month, satisfaction_f) %>%
  ggplot(aes(x = month, y = n, fill = satisfaction_f)) +
  geom_col(position = "dodge") +
  labs(title = "Satisfaction by Month", x = "Month", y = "Number of Surveys") +
  theme_minimal()

Key Commands Summary

Type	Function	Description
Strings	`str_detect()`	Detects if pattern matches
	`str_replace_all()`	Replaces all matches
	`str_extract()`	Extracts first match
	`str_split()`	Splits string into parts
Dates	`ymd()`, `dmy()`, `mdy()`	Parses dates by order
	`year()`, `month()`, `day()`	Extracts date components
	`ymd() + days(10)`	Adds days, months, years to date
	`interval()`, `%within%`	Creates and evaluates time intervals
Factors	`fct_relevel()`	Reorders levels manually
	`fct_infreq()`	Orders levels by frequency
	`fct_lump_n()`	Collapses least frequent levels
	`fct_collapse()`	Groups specific levels into a new one

Practical Exercises

Name Cleaning: Given a vector of names with inconsistent formats (UPPERCASE, lowercase, with titles like "Dr."), normalize them to "First Last" with initial capitalization.
Domain Extraction: From a list of emails, extract the domain and count how many there are per domain. Then, collapse all domains with fewer than 5 occurrences into "Other".
Date Conversion: You have dates in "DD-MM-YYYY" and "MM/DD/YYYY" formats mixed together. Convert them all to R’s standard Date format.
Category Reordering: In a sales dataset by product, reorder the products in the bar chart according to total sales amount (from highest to lowest).
Mini Project: Load a real dataset (e.g., from Kaggle) containing at least one text column, one date column, and one categorical column. Apply all techniques from this unit to clean it and prepare it for analysis.

Additional Resources

stringr: https://stringr.tidyverse.org/
lubridate: https://lubridate.tidyverse.org/
forcats: https://forcats.tidyverse.org/
Regex in R: https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf
Cheat Sheets: Download official RStudio cheat sheets for stringr, lubridate, and forcats.

✅ With this unit, you master the essential tools to clean and transform the most challenging data! You are ready to tackle real-world datasets with confidence.

← Module08 Module10 →

Course Info

Course: R-zero-to-hero

Language: EN

Lesson: Module09