📘 Unit 1.2: Data structures in R

“In R, everything is an object — and data structures are the containers that organize those objects.”

🎯 LEARNING OBJECTIVES

By the end of this unit, the student will be able to:

Identify and create the 5 fundamental data structures in R: vectors, matrices, arrays, lists, and data frames.
Understand the difference between atomic (homogeneous) and recursive (heterogeneous) structures.
Correctly apply indexing to access, modify, and extract elements.
Understand the concepts of coercion and recycling in vector operations.
Use basic inspection functions: class(), typeof(), length(), dim(), str(), names(), attributes().
Create numeric and categorical sequences with :, seq(), rep(), letters, LETTERS.
Generate random data with sample(), runif(), rnorm().

📚 1. INTRODUCTION: WHY DO DATA STRUCTURES MATTER?

R was originally designed for statistical analysis, so its data structures are optimized for manipulating sets of observations, variables, and relationships.

💡 Key philosophy: In R, everything is an object, and each object has a class and a structure. Mastering these structures is essential to avoid errors and write efficient code.

📦 2. TYPES OF DATA STRUCTURES

There are two main categories:

Category	Characteristic	Structures
Atomic	Store elements of the same type	Vector, Matrix, Array
Recursive	Can store elements of different types (even other structures)	List, Data Frame

➡️ 3. VECTORS — The Basic Unit of R

🔹 What is a vector?

A vector is a one-dimensional sequence of elements of the same type (numeric, character, logical, etc.).

🔹 Creating vectors

Use the c() function (combine):

# Numeric vector
numbers <- c(1, 2, 3, 4, 5)

# Character vector
names <- c("Ana", "Luis", "Pedro")

# Logical vector
logicals <- c(TRUE, FALSE, TRUE)

# Integer vector (note the L)
integers <- c(1L, 2L, 3L)

# Factor vector (categorical)
factors <- factor(c("low", "medium", "high"))

🔹 Useful functions for vectors

length(numbers)    # → 5
class(numbers)     # → "numeric"
typeof(numbers)    # → "double"
str(numbers)       # → num [1:5] 1 2 3 4 5

🔹 Sequences and repetitions

# Simple sequence
1:10                 # → 1 2 3 ... 10

# Sequence with step
seq(from = 0, to = 10, by = 2)   # → 0 2 4 6 8 10

# Sequence by length
seq(from = 0, to = 1, length.out = 5)  # → 0.00 0.25 0.50 0.75 1.00

# Repetition
rep("Hello", times = 3)        # → "Hello" "Hello" "Hello"
rep(c(1,2), each = 2)         # → 1 1 2 2
rep(c(1,2), times = c(3,2))   # → 1 1 1 2 2

🔹 Automatic coercion

When you mix types in a vector, R converts all elements to the most flexible type:

mixed <- c(1, "two", TRUE)
mixed
# → "1"   "two" "TRUE"   ← All are now characters!

# Coercion hierarchy: logical < integer < numeric < character

🔹 Vector recycling

If you operate on two vectors of different lengths, R recycles the shorter one:

c(1, 2, 3, 4) + c(10, 20)
# → 11 22 13 24   ← c(10,20) is recycled as c(10,20,10,20)

# Warning! If the length is not a multiple, R warns you:
c(1,2,3) + c(10,20)
# → 11 22 13   ← and shows warning: "longer object length is not a multiple of shorter object length"

🔹 Generating random data

# Random sampling
sample(1:10, size = 5)           # → 5 numbers without replacement
sample(c("A","B","C"), 10, replace = TRUE)  # → 10 letters with replacement

# Uniform distribution
runif(5, min = 0, max = 1)       # → 5 numbers between 0 and 1

# Normal distribution
rnorm(5, mean = 100, sd = 15)    # → 5 numbers ~ N(100, 15)

🧮 4. MATRICES — 2-Dimensional Data (Homogeneous)

🔹 What is a matrix?

A matrix is a vector with 2 dimensions (rows and columns). All elements must be of the same type.

🔹 Creating matrices

# From a vector, specifying dimensions
mat <- matrix(1:6, nrow = 2, ncol = 3)
mat
#      [,1] [,2] [,3]
# [1,]    1    3    5
# [2,]    2    4    6

# By default, R fills by COLUMNS. To fill by rows:
mat2 <- matrix(1:6, nrow = 2, ncol = 3, byrow = TRUE)
mat2
#      [,1] [,2] [,3]
# [1,]    1    2    3
# [2,]    4    5    6

🔹 Attributes: row and column names

rownames(mat) <- c("Row1", "Row2")
colnames(mat) <- c("ColA", "ColB", "ColC")
mat
#       ColA ColB ColC
# Row1    1    3    5
# Row2    2    4    6

🔹 Matrix indexing

mat[1, 2]      # → element row 1, column 2 → 3
mat[1, ]       # → entire row 1 → 1 3 5
mat[, 2]       # → entire column 2 → 3 4
mat[mat > 3]   # → logical filtering → 4 5 6

🧩 5. ARRAYS — Multiple Dimensions (Rarely Used in Data Science)

An array is a generalization of a matrix to more than 2 dimensions.

# 2x3x2 Array
arr <- array(1:12, dim = c(2, 3, 2))
arr
# , , 1
#      [,1] [,2] [,3]
# [1,]    1    3    5
# [2,]    2    4    6
# 
# , , 2
#      [,1] [,2] [,3]
# [1,]    7    9   11
# [2,]    8   10   12

⚠️ In practice, arrays are rarely used directly in data analysis. Data frames or tibbles are preferred.

📂 6. LISTS — Flexible Containers (Heterogeneous)

🔹 What is a list?

A list is a recursive vector: it can contain elements of different types... even other lists, data frames, or functions!

🔹 Creating lists

my_list <- list(
  name = "Carlos",
  age = 30,
  married = TRUE,
  grades = c(8.5, 9.0, 7.5),
  matrix = matrix(1:4, 2),
  function = mean
)

my_list
# $name
# [1] "Carlos"
# 
# $age
# [1] 30
# 
# $married
# [1] TRUE
# 
# $grades
# [1] 8.5 9.0 7.5
# 
# $matrix
#      [,1] [,2]
# [1,]    1    3
# [2,]    2    4
# 
# $function
# function (x, ...)  .Primitive("mean")

🔹 List indexing

my_list[1]        # → returns a LIST with the first element
my_list[[1]]      # → returns the VALUE of the first element ("Carlos")
my_list$name    # → equivalent to [[1]] → "Carlos"

# Access nested elements
my_list$matrix[1, 2]   # → 3
my_list[[5]][1, 2]     # → 3

# Add elements
my_list$city <- "Madrid"

📊 7. DATA FRAMES — The King of Data Analysis Structures

🔹 What is a data frame?

A data frame is a list of vectors of equal length, where each vector represents a variable (column) and each position represents an observation (row). It is the most used structure in data science.

🔹 Creating data frames

df <- data.frame(
  name = c("Ana", "Luis", "Pedro"),
  age = c(25, 30, 35),
  salary = c(30000, 45000, 50000),
  active = c(TRUE, FALSE, TRUE)
)

df
#   name age salary active
# 1  Ana  25  30000   TRUE
# 2 Luis  30  45000  FALSE
# 3 Pedro 35  50000   TRUE

🔹 Tibbles: The Modern Evolution

Tibbles (tibble::tibble()) are improved versions of data frames:

library(tibble)

tb <- tibble(
  name = c("Ana", "Luis", "Pedro"),
  age = c(25, 30, 35),
  salary = c(30000, 45000, 50000),
  active = c(TRUE, FALSE, TRUE)
)

tb
# # A tibble: 3 × 4
#   name    age salary active
#   <chr> <dbl>  <dbl> <lgl> 
# 1 Ana      25  30000 TRUE  
# 2 Luis     30  45000 FALSE 
# 3 Pedro    35  50000 TRUE

✅ Advantages of tibbles:

Do not convert strings to factors by default.
Do not print the entire dataset if it is large.
Show the type of each column.
More predictable in subsetting.

🔹 Inspecting data frames

str(df)        # detailed structure
head(df)       # first 6 rows
tail(df)       # last 6 rows
dim(df)        # dimensions (rows, columns)
names(df)      # column names
View(df)       # opens interactive viewer in RStudio
glimpse(df)    # compact view (requires dplyr)

🔍 8. INDEXING — How to Access Data

Indexing is fundamental. Here are the most common forms:

Structure	Syntax	Result
Vector	`x[3]`	Third element
Vector	`x[c(1,3)]`	First and third elements
Vector	`x[x > 5]`	Elements that meet condition
Matrix	`m[2, 3]`	Element row 2, column 3
Matrix	`m[2, ]`	Entire row 2
List	`l[[2]]`	Value of the second element
List	`l[2]`	List containing the second element
List	`l$name`	Access by name
Data Frame	`df[2, 3]`	Element row 2, column 3
Data Frame	`df[, "age"]`	Column "age" as a vector
Data Frame	`df[["age"]]`	Equivalent to the above
Data Frame	`df$age`	Most common form

⚠️ 9. COMMON ERRORS AND TIPS

❌ Frequent errors

# 1. Forgetting that R indexes from 1
x[0]   # → NULL (not an error, but not what you expect)

# 2. Confusing [ ] with [[ ]] in lists
list[1]    # → list of 1 element
list[[1]]  # → value of the first element

# 3. Assigning out of range (R allows it, fills with NA)
vector <- c(1,2,3)
vector[5] <- 10
vector # → 1 2 3 NA 10

# 4. Not considering coercion
c(1, "2") # → "1" "2" (no longer numeric!)

✅ Professional tips

Use str() and glimpse() frequently to understand your data structure.
Prefer tibble over data.frame to avoid surprises.
Use typeof() and class() to debug type errors.
Always name your vectors, matrices, and lists when possible: improves readability.
For large datasets, consider data.table or arrow later on.

🧪 10. PRACTICAL EXERCISES

Exercise 1: Creation and coercion

# Create a vector with: 5, "hello", FALSE, 3.14
# What type results? Why?

Exercise 2: Sequences and sampling

# Create a vector with numbers from 10 to 1 (descending)
# Create a vector that repeats "R" 5 times
# Generate 10 random numbers between 50 and 100

Exercise 3: Matrix and attributes

# Create a 3x3 matrix with numbers 1 to 9, filled by rows
# Assign names: rows = c("A","B","C"), columns = c("X","Y","Z")
# Extract the second row and the first column

Exercise 4: Nested list

# Create a list containing:
# - Your name (character)
# - Your age (number)
# - A vector of your 3 favorite movies
# - A function that calculates the square of a number
# Access the name of the second movie and execute the function with 7

Exercise 5: Realistic data frame

# Create a data frame with 4 columns: product, price, stock, available
# 5 rows with invented data
# Use tibble
# Extract the "price" column as a vector
# Filter products with stock > 10

🧭 11. CONCEPTUAL DIAGRAM (Text Description)

DATA STRUCTURES IN R
│
├── ATOMIC (homogeneous)
│   ├── Vector → 1D → c(1,2,3)
│   ├── Matrix → 2D → matrix(1:6, 2, 3)
│   └── Array → 3D+ → array(1:12, c(2,3,2))
│
└── RECURSIVE (heterogeneous)
    ├── List → list(name="Ana", age=25, grades=c(8,9))
    └── Data Frame → data.frame(name, age, salary) → ¡Data table!

📝 12. SUMMARY AND CHECKLIST

✅ I can create vectors with c(), seq(), rep(), sample().
✅ I understand coercion and recycling in vectors.
✅ I can create and index matrices and assign them names.
✅ I can differentiate between [ ] and [[ ]] in lists.
✅ I know how to create data frames and tibbles, and access their columns.
✅ I use str(), class(), dim(), names() to inspect structures.
✅ I avoid common indexing and coercion errors.
✅ I completed all practical exercises.

📚 ADDITIONAL RESOURCES

📘 R for Data Science, Ch. 20: “Vectors” → https://r4ds.had.co.nz/vectors.html
🎥 DataCamp: “Introduction to R” → Section “Data Structures”
🧩 RStudio Cheatsheet: “Base R” → https://rstudio.com/resources/cheatsheets/
🐦 Twitter: follow #rstats for daily data structure tips

✅ You have completed Unit 1.2!
You now have the foundation to manipulate any type of data in R. In the next unit, you will learn how to operate with them and control the flow of your programs.

← Module01 Module03 →

Course Info

Course: R-zero-to-hero

Language: EN

Lesson: Module02