๐Ÿ“ฅ Unit 2.4: Data Import and Export

Estimated Duration: 1 week (4โ€“6 hours)
Objective: Master modern techniques for importing, managing, and exporting data in multiple formats using the tidyverse ecosystem and complementary tools, ensuring reproducibility and efficiency.


1. Introduction: Why Import/Export is Critical

In real-world data science practice, up to 80% of time is spent preparing and managing data. The ability to read and write data reliably, quickly, and reproducibly is a fundamental skill.

R offers multiple ways to import data, but in this course we prioritize tidyverse tools due to their:

  • Consistency in syntax and behavior.
  • Speed and low memory consumption.
  • Robustness against malformed formats.
  • Integration with modern workflows.

2. Importing Flat Files (CSV, TSV, TXT)

2.1. Reading with readr

The readr package, part of the tidyverse, replaces base functions like read.csv() with faster and more predictable versions.

Main Functions:

library(readr)

# Standard CSV (comma as separator)
data <- read_csv("path/to/file.csv")

# TSV (tab as separator)
data <- read_tsv("path/to/file.tsv")

# Custom delimiter (semicolon, pipe, etc.)
data <- read_delim("path/to/file.txt", delim = ";")

# Read directly from URL
data <- read_csv("https://example.com/data.csv  ")

Key Features of readr:

  • Does not convert strings to factors (unlike base R).
  • Automatically detects data types, but allows explicit specification with col_types.
  • Displays a reading summary in the console (which columns were read, detected issues).
  • Handles missing values consistently (default: "", "NA").

Specifying Column Types:

data <- read_csv("file.csv",
  col_types = cols(
    name = col_character(),
    age = col_integer(),
    salary = col_double(),
    date = col_date(format = "%d/%m/%Y")
  )
)

Handling Issues:

# View parsing warnings
problems(data)

# Read ignoring problems (not recommended for production)
data <- read_csv("file.csv", problems = NULL)

3. Importing Excel Files

Files with .xls and .xlsx extensions are common in business environments. We use the readxl package.

library(readxl)

# Read sheet by name
data <- read_excel("file.xlsx", sheet = "Sheet1")

# Read sheet by index
data <- read_excel("file.xlsx", sheet = 1)

# Specify cell range
data <- read_excel("file.xlsx", range = "A1:D100")

# Skip initial rows (misplaced headers)
data <- read_excel("file.xlsx", skip = 3)

# Specific columns
data <- read_excel("file.xlsx", col_names = c("A", "C", "D"))

โœ… Advantages:

  • Requires no Java or external dependencies.
  • Supports dates, formulas, and complex formats.
  • Very fast and stable.

4. Importing Statistical Formats (SPSS, Stata, SAS)

We use the haven package, also part of the tidyverse.

library(haven)

# SPSS (.sav)
data <- read_sav("file.sav")

# Stata (.dta)
data <- read_dta("file.dta")

# SAS (.sas7bdat, .xpt)
data <- read_sas("file.sas7bdat")
data_xpt <- read_xpt("file.xpt")

๐Ÿ“Œ Note: haven preserves variable and value labels, stored as attributes. To use them as factors:

data$variable <- as_factor(data$variable)

5. Connecting to SQL Databases

For large datasets or production environments, it's common to connect directly to databases using SQL.

5.1. With DBI + RSQLite (for local databases)

library(DBI)

# Connect to SQLite database
con <- dbConnect(RSQLite::SQLite(), "my_database.db")

# List tables
dbListTables(con)

# Read entire table
data <- dbReadTable(con, "table_name")

# Execute SQL query
data <- dbGetQuery(con, "SELECT * FROM table WHERE age > 30")

# Close connection
dbDisconnect(con)

5.2. With odbc (for SQL Server, PostgreSQL, MySQL, etc.)

library(odbc)

# Connect (requires installed driver)
con <- dbConnect(odbc(),
  driver = "SQL Server",
  server = "my-server",
  database = "my_db",
  uid = "user",
  pwd = "password"
)

# Read data
data <- dbGetQuery(con, "SELECT TOP 100 * FROM sales")

dbDisconnect(con)

๐Ÿ’ก Tip: Use dbplyr to write queries with dplyr syntax that are automatically translated to SQL.

library(dbplyr)

remote_table <- tbl(con, "sales")

result <- remote_table %>%
  filter(sales > 1000) %>%
  group_by(region) %>%
  summarise(total = sum(sales)) %>%
  collect()  # Brings data into R

6. Importing from APIs and the Web (JSON, XML)

6.1. REST APIs with httr and jsonlite

library(httr)
library(jsonlite)

# Make GET request
response <- GET("https://api.example.com/data  ")

# Check status
response$status_code

# Convert body to JSON
json_data <- content(response, "text")
data <- fromJSON(json_data, simplifyDataFrame = TRUE)

# Or directly:
data <- fromJSON("https://api.example.com/data  ")

6.2. Basic Web Scraping with rvest

library(rvest)

# Read HTML page
page <- read_html("https://example.com/table  ")

# Extract HTML table
table <- page %>%
  html_node("table") %>%
  html_table()

# Extract text by CSS selectors
titles <- page %>%
  html_nodes(".title") %>%
  html_text()

โš ๏ธ Important: Respect robots.txt and site terms of use. Do not perform aggressive scraping.


7. Data Export

7.1. Export to CSV/TSV with readr

# Export to CSV
write_csv(data, "clean_data.csv")

# Export to TSV
write_tsv(data, "clean_data.tsv")

# With compression
write_csv(data, "clean_data.csv.gz")

7.2. Export to RDS (R binary format)

Ideal for saving R objects exactly as they are (structure, classes, attributes).

# Save
saveRDS(data, "processed_data.rds")

# Load
data <- readRDS("processed_data.rds")

โœ… Advantages:

  • Faster than CSV.
  • Takes up less space.
  • Preserves classes and attributes (dates, factors, etc.).

7.3. Export to Excel

With the writexl package (lightweight, no dependencies):

library(writexl)

write_xlsx(data, "report.xlsx")

# Multiple sheets
write_xlsx(list(Sheet1 = data1, Sheet2 = data2), "report.xlsx")

7.4. Export to Statistical Formats

With haven:

# To SPSS
write_sav(data, "data.sav")

# To Stata
write_dta(data, "data.dta")

8. Path Management and Reproducibility

8.1. Use here::here() for Relative Paths

Avoid absolute paths that break when sharing the project.

library(here)

# Path relative to project directory
data <- read_csv(here("data", "raw", "sales.csv"))

# Save
write_csv(clean_data, here("data", "clean", "sales_clean.csv"))

โœ… Recommended Project Structure:

my_project/
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ raw/
โ”‚   โ””โ”€โ”€ clean/
โ”œโ”€โ”€ scripts/
โ”œโ”€โ”€ output/
โ”œโ”€โ”€ reports/
โ””โ”€โ”€ my_project.Rproj

8.2. Use RStudio Projects

Creating an .Rproj file allows RStudio to automatically set the working directory when opening the project, eliminating the need for setwd().


9. Best Practices and Common Mistakes

โœ… Recommendations:

  • Always use readr, readxl, haven instead of base functions.
  • Specify col_types when you know the data structure (avoids surprises).
  • Save clean copies in .rds to speed up future loads.
  • Use relative paths with here::here().
  • Document the original data source in comments or a README.

โŒ Common Mistakes:

  • Using setwd() โ€” breaks reproducibility.
  • Not specifying encoding โ€” causes errors with accents or special characters.
    read_csv("file.csv", locale = locale(encoding = "latin1"))
  • Reading Excel with read.csv() after saving it as CSV from Excel โ€” introduces formatting errors.
  • Not closing database connections โ€” can lock the database.

10. Practical Exercise: Import-Clean-Export Pipeline

Objective: Create a reproducible script that:

  1. Downloads a dataset in CSV from a public URL.
  2. Imports it with read_csv(), specifying column types.
  3. Performs basic cleaning with dplyr (filter, rename, mutate).
  4. Exports the clean result to .rds and .csv.
  5. Uses relative paths with here::here().
# process_data.R

library(tidyverse)
library(here)

# 1. Import
url <- "https://raw.githubusercontent.com/data-example/sales/master/sales_2023.csv  "
sales <- read_csv(url,
  col_types = cols(
    id = col_integer(),
    date = col_date(format = "%Y-%m-%d"),
    product = col_factor(),
    units = col_integer(),
    price = col_double()
  )
)

# 2. Clean
clean_sales <- sales %>%
  filter(!is.na(price), units > 0) %>%
  mutate(revenue = units * price) %>%
  rename(Product = product, Units = units, Unit_Price = price)

# 3. Export
write_rds(clean_sales, here("data", "clean", "sales_2023_clean.rds"))
write_csv(clean_sales, here("data", "clean", "sales_2023_clean.csv"))

# Success message
cat("โœ… Data processed and saved to:", here("data", "clean"), "\n")

11. Additional Resources


โœ… Upon completing this unit, you will be able to import and export data to and from multiple sources with confidence, speed, and reproducibility, ready to integrate them into analysis and modeling workflows.

Course Info

Course: R-zero-to-hero

Language: EN

Lesson: Module08