🔄 MODULE 5: “Cross-Validation and Overfitting: Building Models That Generalize”

Objective:

Understand why a model performing well on training data may fail in the real world. Learn to detect, prevent, and measure overfitting using robust validation techniques.


5.1 What is Overfitting?

Overfitting occurs when a model learns the training data too well—including noise and random patterns—instead of general patterns. Result:

  • ✅ Excellent performance on training data.
  • ❌ Poor performance on new data (test or production).

Imagine a student who memorizes exam answers instead of understanding concepts. On the real exam, they fail.


5.2 How to Detect Overfitting?

Simplest method: compare performance on training vs. validation/test.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

modelo = LogisticRegression()
modelo.fit(X_train, y_train)

# Training performance
y_train_pred = modelo.predict(X_train)
train_acc = accuracy_score(y_train, y_train_pred)

# Test performance
y_test_pred = modelo.predict(X_test)
test_acc = accuracy_score(y_test, y_test_pred)

print(f"Training Accuracy: {train_acc:.4f}")
print(f"Test Accuracy:     {test_acc:.4f}")

# If train_acc >> test_acc → Overfitting!

⚠️ Typical example:

  • Train Accuracy: 0.998
  • Test Accuracy: 0.821
    → Model is overfitted.

5.3 What Causes Overfitting?

  • Model too complex (e.g., decision tree with depth 20 on a small dataset).
  • Too little training data.
  • Too many features (curse of dimensionality).
  • Noise in data (incorrect labels, untreated outliers).

5.4 Solutions to Avoid Overfitting

➤ Regularization

Adds a penalty to the loss function to prevent coefficients from becoming too large.

# Logistic regression with L2 regularization (Ridge)
modelo_l2 = LogisticRegression(penalty='l2', C=1.0)  # smaller C = more regularization

# With L1 (Lasso) for automatic feature selection
modelo_l1 = LogisticRegression(penalty='l1', solver='liblinear', C=0.1)

L1 (Lasso): Can zero out coefficients → feature selection.
L2 (Ridge): Shrinks coefficients but doesn’t zero them → more stable.


➤ Simplify the Model

  • Reduce tree depth.
  • Reduce layers/neurons in networks.
  • Use fewer features (feature selection).

➤ More Data

Simple but powerful. More representative data helps the model learn more general patterns.


➤ Cross-Validation

The best tool to evaluate a model’s generalization ability before seeing test data.


5.5 Cross-Validation: Your Shield Against Optimism

Simple validation (train/test split) can be misleading if the split is lucky or unlucky.

Cross-validation (CV) splits data into K folds, trains K times, each time using a different fold as validation.

Result: a more robust, reliable performance estimate.

from sklearn.model_selection import cross_val_score

modelo = LogisticRegression()

# 5-fold cross-validation
scores = cross_val_score(modelo, X_train, y_train, cv=5, scoring='roc_auc')

print(f"AUC-ROC per fold: {scores}")
print(f"Average AUC-ROC: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")

Advantages:

  • Uses all data for training and validation.
  • Reduces variance in performance estimation.
  • Ideal for comparing models or tuning hyperparameters.

5.6 Types of Cross-Validation

➤ K-Fold CV (standard)

Splits into K equal parts. Each fold is used once as validation.

from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)

➤ Stratified K-Fold CV

Maintains class proportions in each fold. ESPECIALLY IMPORTANT FOR IMBALANCED DATASETS.

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(modelo, X_train, y_train, cv=skf, scoring='roc_auc')

➤ Leave-One-Out CV (LOO)

Each fold is a single observation. Very computationally expensive; only for very small datasets.


📈 Visualization: Comparing Train/Test vs CV

import numpy as np

# Train and evaluate with simple train/test
modelo.fit(X_train, y_train)
test_score = roc_auc_score(y_test, modelo.predict_proba(X_test)[:,1])

# Evaluate with CV on training set
cv_scores = cross_val_score(modelo, X_train, y_train, cv=5, scoring='roc_auc')

plt.figure(figsize=(8,5))
plt.axhline(y=test_score, color='red', linestyle='--', label=f'Test Score: {test_score:.4f}')
plt.plot(range(1,6), cv_scores, 'bo-', label='CV Scores per Fold')
plt.axhline(y=cv_scores.mean(), color='blue', linestyle='-', label=f'CV Average: {cv_scores.mean():.4f}')
plt.title("Comparison: Cross-Validation vs Final Test")
plt.xlabel("Fold")
plt.ylabel("AUC-ROC")
plt.legend()
plt.grid()
plt.show()

📝 Exercise 5.1: Overfitting Diagnosis and Prevention

Dataset: fraud_features.csv (preprocessed with selected features)

Tasks:

  1. Train a Random Forest model with max depth = 20.
  2. Calculate accuracy and AUC-ROC on training and test sets. Is there overfitting?
  3. Apply stratified 5-fold cross-validation on the training set. Compare mean AUC-ROC with test score.
  4. Now train a Random Forest with max depth = 5. Repeat steps 2 and 3.
  5. Compare both models: which generalizes better? Why?
  6. (Optional) Try Logistic Regression with regularization (C=0.01) and compare.

💡 Additional Notes:

  • Always use StratifiedKFold for imbalanced classification problems.
  • Cross-validation MUST NOT include the final test set. Test set is sacred—used only once, at the end.
  • Overfitting isn’t inherently bad. Sometimes it’s unavoidable. What matters is detecting and controlling it.
  • In Kaggle competitions, winners use stratified cross-validation with multiple seeds to ensure stability.

Course Info

Course: AI-course1

Language: EN

Lesson: Module5