๐Ÿ“˜ Lesson 4: Train Your First Model! โ€” From Theory to Practice with Scikit-learn

"A model without evaluation is like an exam without a grade: you don't know if you passed... or if you just got lucky."


โฑ๏ธ Estimated duration of this lesson: 75-90 minutes


๐Ÿงญ Why is this lesson so important?

Because this is where you stop trusting and start verifying.

You trained a model.
It made predictions.
But that doesn't mean it's good!

Many beginners get excited with a "98% accuracy"... and then discover that their model fails in the most important cases.

In this lesson, you'll learn:

  • Why accuracy isn't everything (especially with imbalanced data!).
  • What a confusion matrix is and how to read it like a professional.
  • What precision, recall, and F1-score mean... and when to use each one.
  • How to interpret probabilities, not just labels.
  • How to avoid fooling yourself with superficial metrics.

โš ๏ธ Friendly warning: This lesson will make you question everything you thought you knew about "good models." But that's good. Humility is the mother of improvement.


๐ŸŽฏ Objectives of this lesson

By the end, you'll be able to:

โœ… Calculate and interpret your model's accuracy.
โœ… Build and understand a confusion matrix.
โœ… Calculate and interpret precision, recall, and F1-score.
โœ… Know when to use each metric according to the problem.
โœ… Evaluate not just predictions, but probabilities.
โœ… Detect if your model is "dumb" or truly intelligent.
โœ… Feel comfortable making decisions based on metrics, not intuition.


๐Ÿ› ๏ธ Tools you'll use

  • Scikit-learn โ†’ accuracy_score, confusion_matrix, classification_report, precision_recall_curve.
  • Matplotlib / Seaborn โ†’ To visualize metrics.
  • Pandas โ†’ To manipulate results.

๐Ÿ’ก Make sure you have your model trained and your predictions (y_test, y_pred) ready from Lesson 4. If not, here's the quick code to catch up:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Load and prepare data
url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/sms.tsv"
data = pd.read_csv(url, sep='\t', names=['label', 'message'])
data['label_encoded'] = data['label'].map({'ham': 0, 'spam': 1})

X = data['message']
y = data['label_encoded']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train model
model = MultinomialNB()
model.fit(X_train_vec, y_train)

# Predict
y_pred = model.predict(X_test_vec)

๐Ÿ“Š Part 1: The Accuracy Illusion โ€” Is it Really 98%?

Let's start with the most popular... and most dangerous metric.


๐Ÿ”น Step 1: Calculate the accuracy

from sklearn.metrics import accuracy_score

acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc:.4f} โ†’ {acc*100:.2f}%")

๐Ÿ“Œ Typical output:

Accuracy: 0.9821 โ†’ 98.21%

โ†’ Wow! 98% correct. Does this mean the model is excellent?

NO! And here's why.


๐Ÿ” The problem with imbalanced data

Remember: in our dataset, only 13.4% are spam. 86.6% are ham.

Imagine a dumb model that always predicts "ham".
What would its accuracy be?

Accuracy = (True Hams) / (Total) = 955 / 1115 โ‰ˆ 85.65%

โ†’ A model that never detects spam would have 85.65% accuracy!

Your model has 98.21% โ†’ it's better than the dumb one... but how much better in what really matters: detecting spam?

๐Ÿ“Œ Conclusion: Accuracy deceives when there's imbalance. You need smarter metrics.


๐Ÿงฉ Part 2: The Confusion Matrix โ€” Your Error Microscope

This is where you see exactly what successes and what errors your model is making.


๐Ÿ”น Step 2: Build the confusion matrix

from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_pred)

# Visualize
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Ham (Pred)', 'Spam (Pred)'], 
            yticklabels=['Ham (Real)', 'Spam (Real)'])
plt.title("Confusion Matrix - Spam Classifier", fontsize=16)
plt.ylabel("True Label", fontsize=12)
plt.xlabel("Predicted Label", fontsize=12)
plt.show()

๐Ÿ“Œ Typical output (approximate values):

          Predicted
          Ham  Spam
Real Ham   950    5
Real Spam   15  145

๐Ÿ” How to read this matrix?

  • True Negatives (TN): 950
    โ†’ Ham messages that the model said were ham. โœ… Perfect!

  • False Positives (FP): 5
    โ†’ Ham messages that the model said were spam. โŒ Serious error! (Marking an important message as spam).

  • False Negatives (FN): 15
    โ†’ Spam messages that the model said were ham. โŒ Serious error! (Letting spam through).

  • True Positives (TP): 145
    โ†’ Spam messages that the model said were spam. โœ… Perfect!

๐Ÿ“Œ This is gold! Now you know where your model fails. It's not an abstract number... these are concrete errors you can improve.


๐ŸŽฏ Part 3: Precision, Recall, and F1-Score โ€” The 3 Metrics That Matter

Now, let's quantify those errors with professional metrics.


๐Ÿ”น Step 3: Calculate the classification report

from sklearn.metrics import classification_report

report = classification_report(y_test, y_pred, 
                               target_names=['Ham', 'Spam'], 
                               output_dict=False)
print(report)

๐Ÿ“Œ Typical output:

              precision    recall  f1-score   support

         Ham       0.98      0.99      0.99       955
        Spam       0.97      0.91      0.94       160

    accuracy                           0.98      1115
   macro avg       0.98      0.95      0.96      1115
weighted avg       0.98      0.98      0.98      1115

๐Ÿ” What do these metrics mean?

1. Precision

"Of all those I said were spam, how many really were?"

Precision (Spam) = TP / (TP + FP) = 145 / (145 + 5) = 145/150 โ‰ˆ 0.97

โ†’ 97% precision in spam: when the model says "spam", it's right 97% of the time. Excellent!

๐Ÿ“Œ When does precision matter?
When the cost of a false positive is high.
Example: Marking an important email as spam โ†’ the user might lose critical information.


2. Recall (Sensitivity, True Positive Rate)

"Of all the spam that existed, how many did I detect?"

Recall (Spam) = TP / (TP + FN) = 145 / (145 + 15) = 145/160 โ‰ˆ 0.91

โ†’ 91% recall in spam: it detected 91% of all spam. Very good!

๐Ÿ“Œ When does recall matter?
When the cost of a false negative is high.
Example: Letting fraudulent spam through โ†’ the user might click and lose money.


3. F1-Score

"Harmonic average between precision and recall. Ideal when you want balance."

F1 = 2 * (Precision * Recall) / (Precision + Recall)
   = 2 * (0.97 * 0.91) / (0.97 + 0.91) โ‰ˆ 0.94

โ†’ 94% F1-score: a good balance between not bothering the user (precision) and protecting them (recall).

๐Ÿ“Œ When to use F1?
When you don't know what's more important, or when you want a single metric that summarizes performance on imbalanced classes.


๐Ÿ“ˆ Part 4: Beyond Labels โ€” Evaluating Probabilities

Your model doesn't just predict "spam" or "ham." It also gives you probabilities.

This is powerful. Because sometimes, you don't want a binary decision... you want to know how sure the model is.


๐Ÿ”น Step 4: Get probabilities

# Get probabilities for each class
y_proba = model.predict_proba(X_test_vec)

# For spam (class 1), it's the second column
y_proba_spam = y_proba[:, 1]

# View the first 10 probabilities
for i in range(10):
    print(f"Message {i+1}: Spam probability = {y_proba_spam[i]:.4f} โ†’ Prediction: {'Spam' if y_pred[i] == 1 else 'Ham'}")

๐Ÿ“Œ Typical output:

Message 1: Spam probability = 0.0002 โ†’ Prediction: Ham
Message 2: Spam probability = 0.9998 โ†’ Prediction: Spam
Message 3: Spam probability = 0.0015 โ†’ Prediction: Ham
...

โ†’ Amazing! The model doesn't just say "spam," it tells you "I'm 99.98% sure."


๐Ÿ”น Step 5: Precision-Recall Curve (optional, but revealing)

What happens if you change the decision threshold?

By default, if probability > 0.5 โ†’ spam.
But what if you use 0.7? Or 0.3?

from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

precision, recall, thresholds = precision_recall_curve(y_test, y_proba_spam)

plt.figure(figsize=(10, 6))
plt.plot(recall, precision, marker='.', label='Naive Bayes')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid(True)
plt.show()

๐Ÿ“Œ What do you see?
A curve that shows the trade-off between precision and recall for different thresholds.
Ideal for choosing a threshold that fits your needs (more precision or more recall).


โŒ Common Errors in this Lesson (Avoid Them!)

  1. Trusting only accuracy โ†’ You miss critical errors.
  2. Ignoring the confusion matrix โ†’ You don't see where the model fails.
  3. Not understanding the difference between precision and recall โ†’ You make wrong decisions.
  4. Forgetting that metrics depend on the problem โ†’ In medicine, recall is vital; in advertising, precision.
  5. Not using probabilities โ†’ You lose valuable information about model confidence.

โœ… Checklist for this lesson โ€” What should you know how to do now?

โ˜ Calculate and interpret accuracy.
โ˜ Build and read a confusion matrix.
โ˜ Calculate and interpret precision, recall, and F1-score.
โ˜ Know when to prioritize precision vs recall according to the problem.
โ˜ Get and analyze prediction probabilities.
โ˜ Understand that a "good" model depends on context, not just a number.
โ˜ Feel comfortable evaluating models with professional rigor.


๐ŸŽฏ Quote to remember:

"Don't measure your model by how much it gets right... measure it by how much what it gets right matters."


โ† Previous: Lesson 3: Data Exploration | Next: Lesson 5: Evaluate Your Model โ†’

Course Info

Course: AI-course0

Language: EN

Lesson: 4 train model