โIn R, everything is an object โ and data structures are the containers that organize those objects.โ
By the end of this unit, the student will be able to:
class(), typeof(), length(), dim(), str(), names(), attributes().:, seq(), rep(), letters, LETTERS.sample(), runif(), rnorm().R was originally designed for statistical analysis, so its data structures are optimized for manipulating sets of observations, variables, and relationships.
๐ก Key philosophy: In R, everything is an object, and each object has a class and a structure. Mastering these structures is essential to avoid errors and write efficient code.
There are two main categories:
| Category | Characteristic | Structures |
|---|---|---|
| Atomic | Store elements of the same type | Vector, Matrix, Array |
| Recursive | Can store elements of different types (even other structures) | List, Data Frame |
A vector is a one-dimensional sequence of elements of the same type (numeric, character, logical, etc.).
Use the c() function (combine):
# Numeric vector
numbers <- c(1, 2, 3, 4, 5)
# Character vector
names <- c("Ana", "Luis", "Pedro")
# Logical vector
logicals <- c(TRUE, FALSE, TRUE)
# Integer vector (note the L)
integers <- c(1L, 2L, 3L)
# Factor vector (categorical)
factors <- factor(c("low", "medium", "high"))
length(numbers) # โ 5
class(numbers) # โ "numeric"
typeof(numbers) # โ "double"
str(numbers) # โ num [1:5] 1 2 3 4 5
# Simple sequence
1:10 # โ 1 2 3 ... 10
# Sequence with step
seq(from = 0, to = 10, by = 2) # โ 0 2 4 6 8 10
# Sequence by length
seq(from = 0, to = 1, length.out = 5) # โ 0.00 0.25 0.50 0.75 1.00
# Repetition
rep("Hello", times = 3) # โ "Hello" "Hello" "Hello"
rep(c(1,2), each = 2) # โ 1 1 2 2
rep(c(1,2), times = c(3,2)) # โ 1 1 1 2 2
When you mix types in a vector, R converts all elements to the most flexible type:
mixed <- c(1, "two", TRUE)
mixed
# โ "1" "two" "TRUE" โ All are now characters!
# Coercion hierarchy: logical < integer < numeric < character
If you operate on two vectors of different lengths, R recycles the shorter one:
c(1, 2, 3, 4) + c(10, 20)
# โ 11 22 13 24 โ c(10,20) is recycled as c(10,20,10,20)
# Warning! If the length is not a multiple, R warns you:
c(1,2,3) + c(10,20)
# โ 11 22 13 โ and shows warning: "longer object length is not a multiple of shorter object length"
# Random sampling
sample(1:10, size = 5) # โ 5 numbers without replacement
sample(c("A","B","C"), 10, replace = TRUE) # โ 10 letters with replacement
# Uniform distribution
runif(5, min = 0, max = 1) # โ 5 numbers between 0 and 1
# Normal distribution
rnorm(5, mean = 100, sd = 15) # โ 5 numbers ~ N(100, 15)
A matrix is a vector with 2 dimensions (rows and columns). All elements must be of the same type.
# From a vector, specifying dimensions
mat <- matrix(1:6, nrow = 2, ncol = 3)
mat
# [,1] [,2] [,3]
# [1,] 1 3 5
# [2,] 2 4 6
# By default, R fills by COLUMNS. To fill by rows:
mat2 <- matrix(1:6, nrow = 2, ncol = 3, byrow = TRUE)
mat2
# [,1] [,2] [,3]
# [1,] 1 2 3
# [2,] 4 5 6
rownames(mat) <- c("Row1", "Row2")
colnames(mat) <- c("ColA", "ColB", "ColC")
mat
# ColA ColB ColC
# Row1 1 3 5
# Row2 2 4 6
mat[1, 2] # โ element row 1, column 2 โ 3
mat[1, ] # โ entire row 1 โ 1 3 5
mat[, 2] # โ entire column 2 โ 3 4
mat[mat > 3] # โ logical filtering โ 4 5 6
An array is a generalization of a matrix to more than 2 dimensions.
# 2x3x2 Array
arr <- array(1:12, dim = c(2, 3, 2))
arr
# , , 1
# [,1] [,2] [,3]
# [1,] 1 3 5
# [2,] 2 4 6
#
# , , 2
# [,1] [,2] [,3]
# [1,] 7 9 11
# [2,] 8 10 12
โ ๏ธ In practice, arrays are rarely used directly in data analysis. Data frames or tibbles are preferred.
A list is a recursive vector: it can contain elements of different types... even other lists, data frames, or functions!
my_list <- list(
name = "Carlos",
age = 30,
married = TRUE,
grades = c(8.5, 9.0, 7.5),
matrix = matrix(1:4, 2),
function = mean
)
my_list
# $name
# [1] "Carlos"
#
# $age
# [1] 30
#
# $married
# [1] TRUE
#
# $grades
# [1] 8.5 9.0 7.5
#
# $matrix
# [,1] [,2]
# [1,] 1 3
# [2,] 2 4
#
# $function
# function (x, ...) .Primitive("mean")
my_list[1] # โ returns a LIST with the first element
my_list[[1]] # โ returns the VALUE of the first element ("Carlos")
my_list$name # โ equivalent to [[1]] โ "Carlos"
# Access nested elements
my_list$matrix[1, 2] # โ 3
my_list[[5]][1, 2] # โ 3
# Add elements
my_list$city <- "Madrid"
A data frame is a list of vectors of equal length, where each vector represents a variable (column) and each position represents an observation (row). It is the most used structure in data science.
df <- data.frame(
name = c("Ana", "Luis", "Pedro"),
age = c(25, 30, 35),
salary = c(30000, 45000, 50000),
active = c(TRUE, FALSE, TRUE)
)
df
# name age salary active
# 1 Ana 25 30000 TRUE
# 2 Luis 30 45000 FALSE
# 3 Pedro 35 50000 TRUE
Tibbles (tibble::tibble()) are improved versions of data frames:
library(tibble)
tb <- tibble(
name = c("Ana", "Luis", "Pedro"),
age = c(25, 30, 35),
salary = c(30000, 45000, 50000),
active = c(TRUE, FALSE, TRUE)
)
tb
# # A tibble: 3 ร 4
# name age salary active
# <chr> <dbl> <dbl> <lgl>
# 1 Ana 25 30000 TRUE
# 2 Luis 30 45000 FALSE
# 3 Pedro 35 50000 TRUE
โ Advantages of tibbles:
str(df) # detailed structure
head(df) # first 6 rows
tail(df) # last 6 rows
dim(df) # dimensions (rows, columns)
names(df) # column names
View(df) # opens interactive viewer in RStudio
glimpse(df) # compact view (requires dplyr)
Indexing is fundamental. Here are the most common forms:
| Structure | Syntax | Result |
|---|---|---|
| Vector | x[3] |
Third element |
| Vector | x[c(1,3)] |
First and third elements |
| Vector | x[x > 5] |
Elements that meet condition |
| Matrix | m[2, 3] |
Element row 2, column 3 |
| Matrix | m[2, ] |
Entire row 2 |
| List | l[[2]] |
Value of the second element |
| List | l[2] |
List containing the second element |
| List | l$name |
Access by name |
| Data Frame | df[2, 3] |
Element row 2, column 3 |
| Data Frame | df[, "age"] |
Column "age" as a vector |
| Data Frame | df[["age"]] |
Equivalent to the above |
| Data Frame | df$age |
Most common form |
# 1. Forgetting that R indexes from 1
x[0] # โ NULL (not an error, but not what you expect)
# 2. Confusing [ ] with [[ ]] in lists
list[1] # โ list of 1 element
list[[1]] # โ value of the first element
# 3. Assigning out of range (R allows it, fills with NA)
vector <- c(1,2,3)
vector[5] <- 10
vector # โ 1 2 3 NA 10
# 4. Not considering coercion
c(1, "2") # โ "1" "2" (no longer numeric!)
str() and glimpse() frequently to understand your data structure.tibble over data.frame to avoid surprises.typeof() and class() to debug type errors.data.table or arrow later on.# Create a vector with: 5, "hello", FALSE, 3.14
# What type results? Why?
# Create a vector with numbers from 10 to 1 (descending)
# Create a vector that repeats "R" 5 times
# Generate 10 random numbers between 50 and 100
# Create a 3x3 matrix with numbers 1 to 9, filled by rows
# Assign names: rows = c("A","B","C"), columns = c("X","Y","Z")
# Extract the second row and the first column
# Create a list containing:
# - Your name (character)
# - Your age (number)
# - A vector of your 3 favorite movies
# - A function that calculates the square of a number
# Access the name of the second movie and execute the function with 7
# Create a data frame with 4 columns: product, price, stock, available
# 5 rows with invented data
# Use tibble
# Extract the "price" column as a vector
# Filter products with stock > 10
DATA STRUCTURES IN R
โ
โโโ ATOMIC (homogeneous)
โ โโโ Vector โ 1D โ c(1,2,3)
โ โโโ Matrix โ 2D โ matrix(1:6, 2, 3)
โ โโโ Array โ 3D+ โ array(1:12, c(2,3,2))
โ
โโโ RECURSIVE (heterogeneous)
โโโ List โ list(name="Ana", age=25, grades=c(8,9))
โโโ Data Frame โ data.frame(name, age, salary) โ ยกData table!
โ
I can create vectors with c(), seq(), rep(), sample().
โ
I understand coercion and recycling in vectors.
โ
I can create and index matrices and assign them names.
โ
I can differentiate between [ ] and [[ ]] in lists.
โ
I know how to create data frames and tibbles, and access their columns.
โ
I use str(), class(), dim(), names() to inspect structures.
โ
I avoid common indexing and coercion errors.
โ
I completed all practical exercises.
โ
You have completed Unit 1.2!
You now have the foundation to manipulate any type of data in R. In the next unit, you will learn how to operate with them and control the flow of your programs.