📘 Lesson 5: Did It Work Well? Evaluating Your Model with Important Metrics

"Don't measure your model by how much it gets right... measure it by how much what it gets right matters."

⏱️ Estimated duration of this lesson: 75-90 minutes

🧭 Why is this lesson so important?

Because this is where you stop trusting and start verifying.

You trained a model.
It made predictions.
But that doesn't mean it's good!

Many beginners get excited with a "98% accuracy"... and then discover that their model fails in the most important cases.

In this lesson, you'll learn:

Why accuracy isn't everything (especially with imbalanced data!).
What a confusion matrix is and how to read it like a professional.
What precision, recall, and F1-score mean... and when to use each one.
How to interpret probabilities, not just labels.
How to avoid fooling yourself with superficial metrics.

⚠️ Friendly warning: This lesson will make you question everything you thought you knew about "good models." But that's good. Humility is the mother of improvement.

🎯 Objectives of this lesson

By the end, you'll be able to:

✅ Calculate and interpret your model's accuracy.
✅ Build and understand a confusion matrix.
✅ Calculate and interpret precision, recall, and F1-score.
✅ Know when to use each metric according to the problem.
✅ Evaluate not just predictions, but probabilities.
✅ Detect if your model is "dumb" or truly intelligent.
✅ Feel comfortable making decisions based on metrics, not intuition.

🛠️ Tools you'll use

Scikit-learn → accuracy_score, confusion_matrix, classification_report, precision_recall_curve.
Matplotlib / Seaborn → To visualize metrics.
Pandas → To manipulate results.

💡 Make sure you have your model trained and your predictions (y_test, y_pred) ready from Lesson 4. If not, here's the quick code to catch up:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Load and prepare data
url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/sms.tsv"
data = pd.read_csv(url, sep='\t', names=['label', 'message'])
data['label_encoded'] = data['label'].map({'ham': 0, 'spam': 1})

X = data['message']
y = data['label_encoded']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train model
model = MultinomialNB()
model.fit(X_train_vec, y_train)

# Predict
y_pred = model.predict(X_test_vec)

📊 Part 1: The Accuracy Illusion — Is it Really 98%?

Let's start with the most popular... and most dangerous metric.

🔹 Step 1: Calculate the accuracy

from sklearn.metrics import accuracy_score

acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc:.4f} → {acc*100:.2f}%")

📌 Typical output:

Accuracy: 0.9821 → 98.21%

→ Wow! 98% correct. Does this mean the model is excellent?

NO! And here's why.

🔍 The problem with imbalanced data

Remember: in our dataset, only 13.4% are spam. 86.6% are ham.

Imagine a dumb model that always predicts "ham".
What would its accuracy be?

Accuracy = (True Hams) / (Total) = 955 / 1115 ≈ 85.65%

→ A model that never detects spam would have 85.65% accuracy!

Your model has 98.21% → it's better than the dumb one... but how much better in what really matters: detecting spam?

📌 Conclusion: Accuracy deceives when there's imbalance. You need smarter metrics.

🧩 Part 2: The Confusion Matrix — Your Error Microscope

This is where you see exactly what successes and what errors your model is making.

🔹 Step 2: Build the confusion matrix

from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_pred)

# Visualize
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Ham (Pred)', 'Spam (Pred)'], 
            yticklabels=['Ham (Real)', 'Spam (Real)'])
plt.title("Confusion Matrix - Spam Classifier", fontsize=16)
plt.ylabel("True Label", fontsize=12)
plt.xlabel("Predicted Label", fontsize=12)
plt.show()

📌 Typical output (approximate values):

          Predicted
          Ham  Spam
Real Ham   950    5
Real Spam   15  145

🔍 How to read this matrix?

True Negatives (TN): 950
→ Ham messages that the model said were ham. ✅ Perfect!
False Positives (FP): 5
→ Ham messages that the model said were spam. ❌ Serious error! (Marking an important message as spam).
False Negatives (FN): 15
→ Spam messages that the model said were ham. ❌ Serious error! (Letting spam through).
True Positives (TP): 145
→ Spam messages that the model said were spam. ✅ Perfect!

📌 This is gold! Now you know where your model fails. It's not an abstract number... these are concrete errors you can improve.

🎯 Part 3: Precision, Recall, and F1-Score — The 3 Metrics That Matter

Now, let's quantify those errors with professional metrics.

🔹 Step 3: Calculate the classification report

from sklearn.metrics import classification_report

report = classification_report(y_test, y_pred, 
                               target_names=['Ham', 'Spam'], 
                               output_dict=False)
print(report)

📌 Typical output:

              precision    recall  f1-score   support

         Ham       0.98      0.99      0.99       955
        Spam       0.97      0.91      0.94       160

    accuracy                           0.98      1115
   macro avg       0.98      0.95      0.96      1115
weighted avg       0.98      0.98      0.98      1115

🔍 What do these metrics mean?

1. Precision

"Of all those I said were spam, how many really were?"

Precision (Spam) = TP / (TP + FP) = 145 / (145 + 5) = 145/150 ≈ 0.97

→ 97% precision in spam: when the model says "spam", it's right 97% of the time. Excellent!

📌 When does precision matter?
When the cost of a false positive is high.
Example: Marking an important email as spam → the user might lose critical information.

2. Recall (Sensitivity, True Positive Rate)

"Of all the spam that existed, how many did I detect?"

Recall (Spam) = TP / (TP + FN) = 145 / (145 + 15) = 145/160 ≈ 0.91

→ 91% recall in spam: it detected 91% of all spam. Very good!

📌 When does recall matter?
When the cost of a false negative is high.
Example: Letting fraudulent spam through → the user might click and lose money.

3. F1-Score

"Harmonic average between precision and recall. Ideal when you want balance."

F1 = 2 * (Precision * Recall) / (Precision + Recall)
   = 2 * (0.97 * 0.91) / (0.97 + 0.91) ≈ 0.94

→ 94% F1-score: a good balance between not bothering the user (precision) and protecting them (recall).

📌 When to use F1?
When you don't know what's more important, or when you want a single metric that summarizes performance on imbalanced classes.

📈 Part 4: Beyond Labels — Evaluating Probabilities

Your model doesn't just predict "spam" or "ham." It also gives you probabilities.

This is powerful. Because sometimes, you don't want a binary decision... you want to know how sure the model is.

🔹 Step 4: Get probabilities

# Get probabilities for each class
y_proba = model.predict_proba(X_test_vec)

# For spam (class 1), it's the second column
y_proba_spam = y_proba[:, 1]

# View the first 10 probabilities
for i in range(10):
    print(f"Message {i+1}: Spam probability = {y_proba_spam[i]:.4f} → Prediction: {'Spam' if y_pred[i] == 1 else 'Ham'}")

📌 Typical output:

Message 1: Spam probability = 0.0002 → Prediction: Ham
Message 2: Spam probability = 0.9998 → Prediction: Spam
Message 3: Spam probability = 0.0015 → Prediction: Ham
...

→ Amazing! The model doesn't just say "spam," it tells you "I'm 99.98% sure."

🔹 Step 5: Precision-Recall Curve (optional, but revealing)

What happens if you change the decision threshold?

By default, if probability > 0.5 → spam.
But what if you use 0.7? Or 0.3?

from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

precision, recall, thresholds = precision_recall_curve(y_test, y_proba_spam)

plt.figure(figsize=(10, 6))
plt.plot(recall, precision, marker='.', label='Naive Bayes')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid(True)
plt.show()

📌 What do you see?
A curve that shows the trade-off between precision and recall for different thresholds.
Ideal for choosing a threshold that fits your needs (more precision or more recall).

❌ Common Errors in this Lesson (Avoid Them!)

Trusting only accuracy → You miss critical errors.
Ignoring the confusion matrix → You don't see where the model fails.
Not understanding the difference between precision and recall → You make wrong decisions.
Forgetting that metrics depend on the problem → In medicine, recall is vital; in advertising, precision.
Not using probabilities → You lose valuable information about model confidence.

✅ Checklist for this lesson — What should you know how to do now?

☐ Calculate and interpret accuracy.
☐ Build and read a confusion matrix.
☐ Calculate and interpret precision, recall, and F1-score.
☐ Know when to prioritize precision vs recall according to the problem.
☐ Get and analyze prediction probabilities.
☐ Understand that a "good" model depends on context, not just a number.
☐ Feel comfortable evaluating models with professional rigor.

🎯 Quote to remember:

"Don't measure your model by how much it gets right... measure it by how much what it gets right matters."

← Previous: Lesson 4: Train Your First Model | Next: Final Project →

← 4 train model 6 final project →

Course Info

Course: AI-course0

Language: EN

Lesson: 5 evaluate model

📘 Lesson 5: Did It Work Well? Evaluating Your Model with Important Metrics

⏱️ Estimated duration of this lesson: 75-90 minutes

🧭 Why is this lesson so important?

🎯 Objectives of this lesson

🛠️ Tools you'll use

📊 Part 1: The Accuracy Illusion — Is it Really 98%?

🔹 Step 1: Calculate the accuracy

🔍 The problem with imbalanced data

🧩 Part 2: The Confusion Matrix — Your Error Microscope

🔹 Step 2: Build the confusion matrix

🔍 How to read this matrix?

🎯 Part 3: Precision, Recall, and F1-Score — The 3 Metrics That Matter

🔹 Step 3: Calculate the classification report

🔍 What do these metrics mean?

1. Precision

2. Recall (Sensitivity, True Positive Rate)

3. F1-Score

📈 Part 4: Beyond Labels — Evaluating Probabilities

🔹 Step 4: Get probabilities

🔹 Step 5: Precision-Recall Curve (optional, but revealing)

❌ Common Errors in this Lesson (Avoid Them!)

✅ Checklist for this lesson — What should you know how to do now?

🎯 Quote to remember:

Table of Contents

Course Info