📦 MODULE 1: “Data Quality: Your Model is Only as Good as Your Data”

Objective:

Understand why preprocessing is not optional—it’s the core of any successful ML model. Learn to diagnose dataset health before touching any algorithm.

1.1 Why Preprocessing is 80% of the Work?

“Data science is 80% data cleaning, 20% complaining about cleaning data.” — Anonymous

In the real world, data never comes clean, ordered, and ready to use. It arrives with:

Missing values (NaN, None, "", ?)
Typographical errors (“Madird”, “Barcelonaa”)
Extreme outliers (a salary of 9999999 in a survey of average incomes)
Inconsistent formats (dates like “2024/05/01”, “01-May-2024”, “1 de mayo”)
Unstandardized categorical variables (“Sí”, “si”, “SI”, “yes”)

Direct consequence: Feeding dirty data into a model causes it to learn incorrect patterns. Garbage In → Garbage Out.

1.2 Initial Dataset Diagnosis

Before doing anything, explore your dataset like a detective.

Key Pandas tools:

import pandas as pd

# Load dataset
df = pd.read_csv("datos_fraude.csv")

# Quick preview
print(df.head())
print(df.info())  # Data types and non-null counts
print(df.describe())  # Descriptive statistics (numeric only)

# View unique values in categorical columns
print(df['tipo_transaccion'].unique())
print(df['pais'].value_counts())

# Check for null values
print(df.isnull().sum())

Diagnostic visualization with Seaborn:

import seaborn as sns
import matplotlib.pyplot as plt

# Histogram of a numeric variable
sns.histplot(df['monto_transaccion'], bins=50, kde=True)
plt.title("Transaction Amount Distribution")
plt.show()

# Boxplot to detect outliers
sns.boxplot(x=df['monto_transaccion'])
plt.title("Boxplot: Finding Outliers in Amounts")
plt.show()

# Bar plot for categorical variables
sns.countplot(data=df, x='tipo_tarjeta')
plt.title("Distribution of Card Types")
plt.xticks(rotation=45)
plt.show()

1.3 Handling Missing Values

What to do with NaNs?

Option 1: Delete rows or columns

# Delete rows with any NaN
df_clean = df.dropna()

# Delete columns with more than 50% NaN
df_clean = df.dropna(axis=1, thresh=len(df)*0.5)

⚠️ Caution: Only advisable if you lose few data points. If you delete 30% of your rows, your model may become biased.

Option 2: Imputation (fill with values)

from sklearn.impute import SimpleImputer

# Impute mean for numeric variables
imputer_num = SimpleImputer(strategy='mean')
df[['edad', 'monto_transaccion']] = imputer_num.fit_transform(df[['edad', 'monto_transaccion']])

# Impute mode for categorical variables
imputer_cat = SimpleImputer(strategy='most_frequent')
df[['tipo_tarjeta']] = imputer_cat.fit_transform(df[['tipo_tarjeta']])

✅ Best practice: Use ColumnTransformer to apply different strategies to different columns (we’ll cover this in detail in Module 3).

1.4 Handling Outliers

What is an outlier?

A value that significantly deviates from the rest of the data. It may be:

An input error (e.g., age = 999)
A real but extreme case (e.g., a $1,000,000 transaction in a dataset of average $50 purchases)

Detection methods:

IQR Rule (Interquartile Range):

Q1 = df['monto_transaccion'].quantile(0.25)
Q3 = df['monto_transaccion'].quantile(0.75)
IQR = Q3 - Q1

limite_inferior = Q1 - 1.5 * IQR
limite_superior = Q3 + 1.5 * IQR

outliers = df[(df['monto_transaccion'] < limite_inferior) | (df['monto_transaccion'] > limite_superior)]
print(f"Outliers detected: {len(outliers)}")

What to do with them?

Delete them (if clearly errors)
Capping: Replace with upper/lower limit

df['monto_transaccion'] = df['monto_transaccion'].clip(lower=limite_inferior, upper=limite_superior)

Logarithmic transformation (if distribution is highly skewed)

df['log_monto'] = np.log1p(df['monto_transaccion'])  # log(1+x) to avoid log(0)

📝 Exercise 1.1: Diagnosis and Cleaning

Suggested dataset: fraud_data.csv (simulated, with columns: user_id, monto, edad, pais, tipo_tarjeta, hora_dia, es_fraude)

Tasks:

Load the dataset and display .info() and .describe().
Identify which columns have missing values and decide how to impute them (justify your choice).
Use a boxplot to identify outliers in monto. Apply IQR-based capping.
Check the edad distribution. Are there impossible values (e.g., age < 0 or > 120)? Correct them.
Save the cleaned dataset as fraud_clean.csv.

💡 Additional Notes:

Never assume data is clean. Always explore first.
Document every change. Use comments or markdown cells to explain why you deleted or imputed something.
Outliers aren’t always bad. In fraud detection, the outlier may be exactly what you’re looking for!

← Home Module2 →

Course Info

Course: AI-course1

Language: EN

Lesson: Module1