📦 MODULE 1: “Data Quality: Your Model is Only as Good as Your Data”

Objective:

Understand why preprocessing is not optional—it’s the core of any successful ML model. Learn to diagnose dataset health before touching any algorithm.


1.1 Why Preprocessing is 80% of the Work?

“Data science is 80% data cleaning, 20% complaining about cleaning data.” — Anonymous

In the real world, data never comes clean, ordered, and ready to use. It arrives with:

  • Missing values (NaN, None, "", ?)
  • Typographical errors (“Madird”, “Barcelonaa”)
  • Extreme outliers (a salary of 9999999 in a survey of average incomes)
  • Inconsistent formats (dates like “2024/05/01”, “01-May-2024”, “1 de mayo”)
  • Unstandardized categorical variables (“Sí”, “si”, “SI”, “yes”)

Direct consequence: Feeding dirty data into a model causes it to learn incorrect patterns. Garbage In → Garbage Out.


1.2 Initial Dataset Diagnosis

Before doing anything, explore your dataset like a detective.

Key Pandas tools:

import pandas as pd

# Load dataset
df = pd.read_csv("datos_fraude.csv")

# Quick preview
print(df.head())
print(df.info())  # Data types and non-null counts
print(df.describe())  # Descriptive statistics (numeric only)

# View unique values in categorical columns
print(df['tipo_transaccion'].unique())
print(df['pais'].value_counts())

# Check for null values
print(df.isnull().sum())

Diagnostic visualization with Seaborn:

import seaborn as sns
import matplotlib.pyplot as plt

# Histogram of a numeric variable
sns.histplot(df['monto_transaccion'], bins=50, kde=True)
plt.title("Transaction Amount Distribution")
plt.show()

# Boxplot to detect outliers
sns.boxplot(x=df['monto_transaccion'])
plt.title("Boxplot: Finding Outliers in Amounts")
plt.show()

# Bar plot for categorical variables
sns.countplot(data=df, x='tipo_tarjeta')
plt.title("Distribution of Card Types")
plt.xticks(rotation=45)
plt.show()

1.3 Handling Missing Values

What to do with NaNs?

Option 1: Delete rows or columns

# Delete rows with any NaN
df_clean = df.dropna()

# Delete columns with more than 50% NaN
df_clean = df.dropna(axis=1, thresh=len(df)*0.5)

⚠️ Caution: Only advisable if you lose few data points. If you delete 30% of your rows, your model may become biased.


Option 2: Imputation (fill with values)

from sklearn.impute import SimpleImputer

# Impute mean for numeric variables
imputer_num = SimpleImputer(strategy='mean')
df[['edad', 'monto_transaccion']] = imputer_num.fit_transform(df[['edad', 'monto_transaccion']])

# Impute mode for categorical variables
imputer_cat = SimpleImputer(strategy='most_frequent')
df[['tipo_tarjeta']] = imputer_cat.fit_transform(df[['tipo_tarjeta']])

Best practice: Use ColumnTransformer to apply different strategies to different columns (we’ll cover this in detail in Module 3).


1.4 Handling Outliers

What is an outlier?

A value that significantly deviates from the rest of the data. It may be:

  • An input error (e.g., age = 999)
  • A real but extreme case (e.g., a $1,000,000 transaction in a dataset of average $50 purchases)

Detection methods:

  • IQR Rule (Interquartile Range):
Q1 = df['monto_transaccion'].quantile(0.25)
Q3 = df['monto_transaccion'].quantile(0.75)
IQR = Q3 - Q1

limite_inferior = Q1 - 1.5 * IQR
limite_superior = Q3 + 1.5 * IQR

outliers = df[(df['monto_transaccion'] < limite_inferior) | (df['monto_transaccion'] > limite_superior)]
print(f"Outliers detected: {len(outliers)}")

What to do with them?

  • Delete them (if clearly errors)
  • Capping: Replace with upper/lower limit
df['monto_transaccion'] = df['monto_transaccion'].clip(lower=limite_inferior, upper=limite_superior)
  • Logarithmic transformation (if distribution is highly skewed)
df['log_monto'] = np.log1p(df['monto_transaccion'])  # log(1+x) to avoid log(0)

📝 Exercise 1.1: Diagnosis and Cleaning

Suggested dataset: fraud_data.csv (simulated, with columns: user_id, monto, edad, pais, tipo_tarjeta, hora_dia, es_fraude)

Tasks:

  1. Load the dataset and display .info() and .describe().
  2. Identify which columns have missing values and decide how to impute them (justify your choice).
  3. Use a boxplot to identify outliers in monto. Apply IQR-based capping.
  4. Check the edad distribution. Are there impossible values (e.g., age < 0 or > 120)? Correct them.
  5. Save the cleaned dataset as fraud_clean.csv.

💡 Additional Notes:

  • Never assume data is clean. Always explore first.
  • Document every change. Use comments or markdown cells to explain why you deleted or imputed something.
  • Outliers aren’t always bad. In fraud detection, the outlier may be exactly what you’re looking for!

Course Info

Course: AI-course1

Language: EN

Lesson: Module1