📊 MODULE 3: “Scaling, Selection, and Feature Extraction: The Art of Preparing Inputs”

Objective:

Master how to transform, select, and create features so models learn efficiently, stably, and without numerical bias. Learn why not all variables are useful—and how to make the useful ones shine.


3.1 Why Scale Features?

Many ML algorithms (SVM, KNN, logistic regression, neural networks) are sensitive to feature scales.

Imagine:

  • Feature 1: edad → range 18 to 90
  • Feature 2: ingreso_anual → range 20,000 to 500,000

Without scaling, the algorithm will give MUCH more weight to ingreso_anual simply because its numbers are larger—even if edad is more predictive!


3.2 Scaling Methods

➤ StandardScaler (Standardization)

Transforms data to have mean = 0 and standard deviation = 1.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['edad', 'ingreso_anual']] = scaler.fit_transform(df[['edad', 'ingreso_anual']])

When to use: When data approximately follows a normal distribution. Ideal for linear models, SVM, neural networks.


➤ MinMaxScaler (Normalization)

Transforms data to a fixed range, typically [0, 1].

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[['monto_transaccion', 'antiguedad_cliente']] = scaler.fit_transform(df[['monto_transaccion', 'antiguedad_cliente']])

When to use: When you know min/max bounds, or when using neural networks with activation functions like sigmoid or tanh.

⚠️ Beware of outliers: A single extreme value can compress the entire rest of the range.


➤ RobustScaler

Uses the median and interquartile range (IQR). Robust to outliers.

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
df[['monto_transaccion']] = scaler.fit_transform(df[['monto_transaccion']])

When to use: When you have many outliers and don’t want to remove or transform them.


📈 Visualization: Comparing Scalings

import matplotlib.pyplot as plt
import seaborn as sns

# Original data
original = df_original['monto_transaccion'].values.reshape(-1, 1)
standard_scaled = StandardScaler().fit_transform(original)
minmax_scaled = MinMaxScaler().fit_transform(original)
robust_scaled = RobustScaler().fit_transform(original)

fig, ax = plt.subplots(2, 2, figsize=(12, 8))

sns.histplot(original, bins=30, ax=ax[0,0], kde=True)
ax[0,0].set_title("Original")

sns.histplot(standard_scaled, bins=30, ax=ax[0,1], kde=True)
ax[0,1].set_title("StandardScaler")

sns.histplot(minmax_scaled, bins=30, ax=ax[1,0], kde=True)
ax[1,0].set_title("MinMaxScaler")

sns.histplot(robust_scaled, bins=30, ax=ax[1,1], kde=True)
ax[1,1].set_title("RobustScaler")

plt.tight_layout()
plt.show()

3.3 Feature Selection

Not all variables are useful. Some are redundant, irrelevant, or noisy. Keeping them:

  • Increases training time.
  • May cause overfitting.
  • Makes interpretation harder.

➤ Variance-based elimination

If a variable barely changes (e.g., 99% of values are 0), it adds no information.

from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold(threshold=0.01)  # removes columns with variance < 0.01
X_high_variance = selector.fit_transform(X)

➤ Univariate Selection

Uses statistical tests to measure the relationship between each feature and the target variable.

from sklearn.feature_selection import SelectKBest, f_classif

# Select top 10 features using ANOVA F-test
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)

# View scores
scores = selector.scores_
feature_names = X.columns
plt.figure(figsize=(10,6))
sns.barplot(x=scores, y=feature_names)
plt.title("Feature Importance (ANOVA F-test)")
plt.show()

➤ Recursive Feature Elimination (RFE)

Trains a model and iteratively removes least important features.

from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

modelo_base = RandomForestClassifier(n_estimators=10, random_state=42)
rfe = RFE(estimator=modelo_base, n_features_to_select=8)
X_rfe = rfe.fit_transform(X, y)

# View selected features
selected_features = X.columns[rfe.support_]
print("Selected features:", selected_features.tolist())

3.4 Feature Extraction

Instead of selecting, create new features from existing ones.

➤ Principal Components (PCA)

Reduces dimensionality by transforming original variables into a smaller set of uncorrelated variables (principal components).

from sklearn.decomposition import PCA

pca = PCA(n_components=5)  # reduce to 5 components
X_pca = pca.fit_transform(X_scaled)

# View variance explained per component
plt.figure(figsize=(8,5))
plt.plot(range(1,6), pca.explained_variance_ratio_.cumsum(), marker='o')
plt.title("Cumulative Variance Explained by PCA Components")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Variance")
plt.grid()
plt.show()

When to use PCA: When you have many correlated variables, or for visualization (reduce to 2D/3D).

⚠️ Disadvantage: Lose interpretability. “Component 1” has no clear business meaning.


📝 Exercise 3.1: Transformation and Feature Selection

Dataset: fraud_encoded.csv (from previous module, already encoded)

Tasks:

  1. Separate features (X) from target variable (y = es_fraude).
  2. Split into train/test (80/20, stratify=y).
  3. Apply StandardScaler to continuous numeric variables (e.g., edad, ingreso, monto).
  4. Use VarianceThreshold to remove features with variance < 0.01.
  5. Use SelectKBest with f_classif to select the 12 most relevant features.
  6. Train a simple model (LogisticRegression) with selected features.
  7. Compare performance (accuracy) with a model trained on all features (without selection).

💡 Additional Notes:

  • Always apply scaling ONLY to the training set, then use transform() (not fit_transform()) on test.
  • Feature selection must be done WITHIN cross-validation to avoid optimistic metrics.
  • PCA is not magic. If your variables are already few and uncorrelated, PCA may worsen performance.
  • In Kaggle competitions, feature engineering often separates the top 10% from the rest.

Course Info

Course: AI-course1

Language: EN

Lesson: Module3