🎯 MODULE 6: “Final Project: Building a Fraud Detection System with Real Business Metrics”

Objective:

Apply EVERYTHING learned in a realistic integrative project: from raw data loading to a model evaluated with business metrics, including preprocessing, encoding, scaling, feature selection, and robust validation.

6.1 Project Dataset: `transacciones_fraude.csv`

Simulated features:

monto_transaccion (float)
edad_cliente (int)
tipo_tarjeta (categorical: “Visa”, “Mastercard”, “Amex”)
pais_origen (categorical)
hora_del_dia (int 0-23)
dias_desde_ultima_transaccion (int)
es_fraude (bool: 0 or 1) → only 1.5% fraud!

6.2 Phase 1: Initial Diagnosis and Cleaning

# Load and explore
df = pd.read_csv("transacciones_fraude.csv")
print(df.info())
print(df.isnull().sum())

# Impute age with median, card type with mode
df['edad_cliente'].fillna(df['edad_cliente'].median(), inplace=True)
df['tipo_tarjeta'].fillna(df['tipo_tarjeta'].mode()[0], inplace=True)

# Cap transaction amount using IQR
Q1 = df['monto_transaccion'].quantile(0.25)
Q3 = df['monto_transaccion'].quantile(0.75)
IQR = Q3 - Q1
lim_inf, lim_sup = Q1 - 1.5*IQR, Q3 + 1.5*IQR
df['monto_transaccion'] = df['monto_transaccion'].clip(lim_inf, lim_sup)

6.3 Phase 2: Encoding and Scaling

# Encoding
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Label Encoding for ordinal variables (none here, but just in case)
# One-Hot for nominal categorical variables
categorical_features = ['tipo_tarjeta', 'pais_origen']
numeric_features = ['monto_transaccion', 'edad_cliente', 'hora_del_dia', 'dias_desde_ultima_transaccion']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), categorical_features)
    ])

X = df.drop('es_fraude', axis=1)
y = df['es_fraude']

X_processed = preprocessor.fit_transform(X)

6.4 Phase 3: Feature Selection and Training

from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.linear_model import LogisticRegression

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, stratify=y, random_state=42)

# Feature selection
selector = SelectKBest(score_func=f_classif, k=10)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

# Train model
modelo = LogisticRegression(penalty='l2', C=1.0, random_state=42)
modelo.fit(X_train_selected, y_train)

6.5 Phase 4: Deep Evaluation with Business Metrics

from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix
import seaborn as sns

y_pred = modelo.predict(X_test_selected)
y_proba = modelo.predict_proba(X_test_selected)[:,1]

print("=== CLASSIFICATION REPORT ===")
print(classification_report(y_test, y_pred, target_names=['Legítimo', 'Fraude']))

print(f"\nAUC-ROC: {roc_auc_score(y_test, y_proba):.4f}")

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Legítimo', 'Fraude'],
            yticklabels=['Legítimo', 'Fraude'])
plt.title("Confusion Matrix - Fraud Detection")
plt.ylabel("Actual")
plt.xlabel("Predicted")
plt.show()

# Cross-validation for confidence
cv_scores = cross_val_score(modelo, X_train_selected, y_train, cv=StratifiedKFold(5), scoring='roc_auc')
print(f"\nCross-Validation AUC-ROC: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")

6.6 Phase 5: Threshold Tuning to Maximize Recall

from sklearn.metrics import precision_recall_curve

# Find threshold maximizing F1 or recall
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)

# Calculate F1 for each threshold
f1_scores = 2 * (precision * recall) / (precision + recall)
best_threshold = thresholds[np.argmax(f1_scores)]

print(f"Best threshold for F1: {best_threshold:.3f}")

# Predict with new threshold
y_pred_new = (y_proba >= best_threshold).astype(int)

print("\n=== WITH ADJUSTED THRESHOLD ===")
print(classification_report(y_test, y_pred_new, target_names=['Legítimo', 'Fraude']))

📝 Project Deliverables:

Jupyter Notebook with all phases documented.
Graphs: initial distribution, confusion matrix, ROC curve, Precision-Recall curve.
Comparison table: metrics at 0.5 threshold vs. optimal threshold.
Written conclusion: How good is the model? Which metric would you prioritize in production and why? What would you improve in the next iteration?

💡 Final Course Notes:

✅ Preprocessing isn’t a necessary evil—it’s your superpower.
✅ Never trust accuracy in imbalanced problems. Use AUC-ROC, Recall, F1.
✅ Stratified cross-validation is your ally for building robust models.
✅ Document every step. Your future self (and your team) will thank you.
✅ In production, monitor not just accuracy, but your data distribution. It can drift over time!

← Module5

Course Info

Course: AI-course1

Language: EN

Lesson: Module6