🔄 MODULE 5: “Cross-Validation and Overfitting: Building Models That Generalize”

Objective:

Understand why a model performing well on training data may fail in the real world. Learn to detect, prevent, and measure overfitting using robust validation techniques.

5.1 What is Overfitting?

Overfitting occurs when a model learns the training data too well—including noise and random patterns—instead of general patterns. Result:

✅ Excellent performance on training data.
❌ Poor performance on new data (test or production).

Imagine a student who memorizes exam answers instead of understanding concepts. On the real exam, they fail.

5.2 How to Detect Overfitting?

Simplest method: compare performance on training vs. validation/test.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

modelo = LogisticRegression()
modelo.fit(X_train, y_train)

# Training performance
y_train_pred = modelo.predict(X_train)
train_acc = accuracy_score(y_train, y_train_pred)

# Test performance
y_test_pred = modelo.predict(X_test)
test_acc = accuracy_score(y_test, y_test_pred)

print(f"Training Accuracy: {train_acc:.4f}")
print(f"Test Accuracy:     {test_acc:.4f}")

# If train_acc >> test_acc → Overfitting!

⚠️ Typical example:

Train Accuracy: 0.998
Test Accuracy: 0.821
→ Model is overfitted.

5.3 What Causes Overfitting?

Model too complex (e.g., decision tree with depth 20 on a small dataset).
Too little training data.
Too many features (curse of dimensionality).
Noise in data (incorrect labels, untreated outliers).

5.4 Solutions to Avoid Overfitting

➤ Regularization

Adds a penalty to the loss function to prevent coefficients from becoming too large.

# Logistic regression with L2 regularization (Ridge)
modelo_l2 = LogisticRegression(penalty='l2', C=1.0)  # smaller C = more regularization

# With L1 (Lasso) for automatic feature selection
modelo_l1 = LogisticRegression(penalty='l1', solver='liblinear', C=0.1)

✅ L1 (Lasso): Can zero out coefficients → feature selection.
✅ L2 (Ridge): Shrinks coefficients but doesn’t zero them → more stable.

➤ Simplify the Model

Reduce tree depth.
Reduce layers/neurons in networks.
Use fewer features (feature selection).

➤ More Data

Simple but powerful. More representative data helps the model learn more general patterns.

➤ Cross-Validation

The best tool to evaluate a model’s generalization ability before seeing test data.

5.5 Cross-Validation: Your Shield Against Optimism

Simple validation (train/test split) can be misleading if the split is lucky or unlucky.

Cross-validation (CV) splits data into K folds, trains K times, each time using a different fold as validation.

Result: a more robust, reliable performance estimate.

from sklearn.model_selection import cross_val_score

modelo = LogisticRegression()

# 5-fold cross-validation
scores = cross_val_score(modelo, X_train, y_train, cv=5, scoring='roc_auc')

print(f"AUC-ROC per fold: {scores}")
print(f"Average AUC-ROC: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")

✅ Advantages:

Uses all data for training and validation.
Reduces variance in performance estimation.
Ideal for comparing models or tuning hyperparameters.

5.6 Types of Cross-Validation

➤ K-Fold CV (standard)

Splits into K equal parts. Each fold is used once as validation.

from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)

➤ Stratified K-Fold CV

Maintains class proportions in each fold. ESPECIALLY IMPORTANT FOR IMBALANCED DATASETS.

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(modelo, X_train, y_train, cv=skf, scoring='roc_auc')

➤ Leave-One-Out CV (LOO)

Each fold is a single observation. Very computationally expensive; only for very small datasets.

📈 Visualization: Comparing Train/Test vs CV

import numpy as np

# Train and evaluate with simple train/test
modelo.fit(X_train, y_train)
test_score = roc_auc_score(y_test, modelo.predict_proba(X_test)[:,1])

# Evaluate with CV on training set
cv_scores = cross_val_score(modelo, X_train, y_train, cv=5, scoring='roc_auc')

plt.figure(figsize=(8,5))
plt.axhline(y=test_score, color='red', linestyle='--', label=f'Test Score: {test_score:.4f}')
plt.plot(range(1,6), cv_scores, 'bo-', label='CV Scores per Fold')
plt.axhline(y=cv_scores.mean(), color='blue', linestyle='-', label=f'CV Average: {cv_scores.mean():.4f}')
plt.title("Comparison: Cross-Validation vs Final Test")
plt.xlabel("Fold")
plt.ylabel("AUC-ROC")
plt.legend()
plt.grid()
plt.show()

📝 Exercise 5.1: Overfitting Diagnosis and Prevention

Dataset: fraud_features.csv (preprocessed with selected features)

Tasks:

Train a Random Forest model with max depth = 20.
Calculate accuracy and AUC-ROC on training and test sets. Is there overfitting?
Apply stratified 5-fold cross-validation on the training set. Compare mean AUC-ROC with test score.
Now train a Random Forest with max depth = 5. Repeat steps 2 and 3.
Compare both models: which generalizes better? Why?
(Optional) Try Logistic Regression with regularization (C=0.01) and compare.

💡 Additional Notes:

Always use StratifiedKFold for imbalanced classification problems.
Cross-validation MUST NOT include the final test set. Test set is sacred—used only once, at the end.
Overfitting isn’t inherently bad. Sometimes it’s unavoidable. What matters is detecting and controlling it.
In Kaggle competitions, winners use stratified cross-validation with multiple seeds to ensure stability.

← Module4 Module6 →

Course Info

Course: AI-course1

Language: EN

Lesson: Module5