Learn to convert text, labels, and categories into numbers that models can understand—without introducing bias or distortion.
ML algorithms (regression, SVM, neural networks) only understand numbers. They cannot process “Visa”, “Mastercard”, or “Spain” directly.
But… beware! It’s not just about assigning arbitrary numbers. How you encode affects model performance and interpretation.
Assigns a unique integer to each category.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['tipo_tarjeta_encoded'] = le.fit_transform(df['tipo_tarjeta'])
# Example: ["Visa", "Mastercard", "Amex"] → [0, 1, 2]
⚠️ Serious problem: Introduces artificial order. The model may interpret “2” > “1” > “0”, as if “Amex” were “better” than “Visa”. This is incorrect if no real order exists.
✅ When to use: Only for ORDINAL variables (e.g., “Low”, “Medium”, “High”) or decision trees (which don’t assume linear order).
Creates a binary column (0/1) for each category.
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
ohe = OneHotEncoder(sparse_output=False, drop='first') # drop='first' avoids multicollinearity
categorias_encoded = ohe.fit_transform(df[['tipo_tarjeta']])
# Convert to DataFrame with names
df_ohe = pd.DataFrame(categorias_encoded, columns=ohe.get_feature_names_out(['tipo_tarjeta']))
df = pd.concat([df.reset_index(drop=True), df_ohe], axis=1)
✅ Advantages:
⚠️ Disadvantages:
Replaces each category with the mean of the target variable for that category.
# Example: tipo_tarjeta → mean of "es_fraude" for that card type
target_mean = df.groupby('tipo_tarjeta')['es_fraude'].mean()
df['tipo_tarjeta_target_encoded'] = df['tipo_tarjeta'].map(target_mean)
✅ Advantages:
⚠️ Risks:
✅ Solution: Use cross-validation or add noise.
fig, ax = plt.subplots(1, 3, figsize=(15, 5))
# Original
sns.countplot(data=df, x='tipo_tarjeta', ax=ax[0])
ax[0].set_title("Original")
# Label Encoded
sns.histplot(df['tipo_tarjeta_encoded'], bins=3, ax=ax[1])
ax[1].set_title("Label Encoding")
# One-Hot (show one column)
sns.histplot(df['tipo_tarjeta_Mastercard'], bins=2, ax=ax[2])
ax[2].set_title("One-Hot (e.g., Mastercard)")
plt.tight_layout()
plt.show()
Dataset: fraud_clean.csv (from previous module)
Tasks:
nivel_riesgo column (assume it’s ordinal: "Bajo", "Medio", "Alto").tipo_tarjeta and pais. Use drop='first'.ciudad (use only the training set to compute means—avoid data leakage!).OrdinalEncoder for ordinal variables (better than LabelEncoder for multiple columns).