Apply EVERYTHING learned in a realistic integrative project: from raw data loading to a model evaluated with business metrics, including preprocessing, encoding, scaling, feature selection, and robust validation.
transacciones_fraude.csvSimulated features:
monto_transaccion (float)edad_cliente (int)tipo_tarjeta (categorical: “Visa”, “Mastercard”, “Amex”)pais_origen (categorical)hora_del_dia (int 0-23)dias_desde_ultima_transaccion (int)es_fraude (bool: 0 or 1) → only 1.5% fraud!# Load and explore
df = pd.read_csv("transacciones_fraude.csv")
print(df.info())
print(df.isnull().sum())
# Impute age with median, card type with mode
df['edad_cliente'].fillna(df['edad_cliente'].median(), inplace=True)
df['tipo_tarjeta'].fillna(df['tipo_tarjeta'].mode()[0], inplace=True)
# Cap transaction amount using IQR
Q1 = df['monto_transaccion'].quantile(0.25)
Q3 = df['monto_transaccion'].quantile(0.75)
IQR = Q3 - Q1
lim_inf, lim_sup = Q1 - 1.5*IQR, Q3 + 1.5*IQR
df['monto_transaccion'] = df['monto_transaccion'].clip(lim_inf, lim_sup)
# Encoding
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
# Label Encoding for ordinal variables (none here, but just in case)
# One-Hot for nominal categorical variables
categorical_features = ['tipo_tarjeta', 'pais_origen']
numeric_features = ['monto_transaccion', 'edad_cliente', 'hora_del_dia', 'dias_desde_ultima_transaccion']
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), categorical_features)
])
X = df.drop('es_fraude', axis=1)
y = df['es_fraude']
X_processed = preprocessor.fit_transform(X)
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.linear_model import LogisticRegression
# Split data
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, stratify=y, random_state=42)
# Feature selection
selector = SelectKBest(score_func=f_classif, k=10)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)
# Train model
modelo = LogisticRegression(penalty='l2', C=1.0, random_state=42)
modelo.fit(X_train_selected, y_train)
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix
import seaborn as sns
y_pred = modelo.predict(X_test_selected)
y_proba = modelo.predict_proba(X_test_selected)[:,1]
print("=== CLASSIFICATION REPORT ===")
print(classification_report(y_test, y_pred, target_names=['Legítimo', 'Fraude']))
print(f"\nAUC-ROC: {roc_auc_score(y_test, y_proba):.4f}")
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Legítimo', 'Fraude'],
yticklabels=['Legítimo', 'Fraude'])
plt.title("Confusion Matrix - Fraud Detection")
plt.ylabel("Actual")
plt.xlabel("Predicted")
plt.show()
# Cross-validation for confidence
cv_scores = cross_val_score(modelo, X_train_selected, y_train, cv=StratifiedKFold(5), scoring='roc_auc')
print(f"\nCross-Validation AUC-ROC: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")
from sklearn.metrics import precision_recall_curve
# Find threshold maximizing F1 or recall
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
# Calculate F1 for each threshold
f1_scores = 2 * (precision * recall) / (precision + recall)
best_threshold = thresholds[np.argmax(f1_scores)]
print(f"Best threshold for F1: {best_threshold:.3f}")
# Predict with new threshold
y_pred_new = (y_proba >= best_threshold).astype(int)
print("\n=== WITH ADJUSTED THRESHOLD ===")
print(classification_report(y_test, y_pred_new, target_names=['Legítimo', 'Fraude']))
✅ Preprocessing isn’t a necessary evil—it’s your superpower.
✅ Never trust accuracy in imbalanced problems. Use AUC-ROC, Recall, F1.
✅ Stratified cross-validation is your ally for building robust models.
✅ Document every step. Your future self (and your team) will thank you.
✅ In production, monitor not just accuracy, but your data distribution. It can drift over time!