"Don't measure your model by how much it gets right... measure it by how much what it gets right matters."
Because this is where you stop trusting and start verifying.
You trained a model.
It made predictions.
But that doesn't mean it's good!
Many beginners get excited with a "98% accuracy"... and then discover that their model fails in the most important cases.
In this lesson, you'll learn:
โ ๏ธ Friendly warning: This lesson will make you question everything you thought you knew about "good models." But that's good. Humility is the mother of improvement.
By the end, you'll be able to:
โ
Calculate and interpret your model's accuracy.
โ
Build and understand a confusion matrix.
โ
Calculate and interpret precision, recall, and F1-score.
โ
Know when to use each metric according to the problem.
โ
Evaluate not just predictions, but probabilities.
โ
Detect if your model is "dumb" or truly intelligent.
โ
Feel comfortable making decisions based on metrics, not intuition.
accuracy_score, confusion_matrix, classification_report, precision_recall_curve.๐ก Make sure you have your model trained and your predictions (
y_test,y_pred) ready from Lesson 4. If not, here's the quick code to catch up:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
# Load and prepare data
url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/sms.tsv"
data = pd.read_csv(url, sep='\t', names=['label', 'message'])
data['label_encoded'] = data['label'].map({'ham': 0, 'spam': 1})
X = data['message']
y = data['label_encoded']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
# Train model
model = MultinomialNB()
model.fit(X_train_vec, y_train)
# Predict
y_pred = model.predict(X_test_vec)
Let's start with the most popular... and most dangerous metric.
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc:.4f} โ {acc*100:.2f}%")
๐ Typical output:
Accuracy: 0.9821 โ 98.21%
โ Wow! 98% correct. Does this mean the model is excellent?
NO! And here's why.
Remember: in our dataset, only 13.4% are spam. 86.6% are ham.
Imagine a dumb model that always predicts "ham".
What would its accuracy be?
Accuracy = (True Hams) / (Total) = 955 / 1115 โ 85.65%
โ A model that never detects spam would have 85.65% accuracy!
Your model has 98.21% โ it's better than the dumb one... but how much better in what really matters: detecting spam?
๐ Conclusion: Accuracy deceives when there's imbalance. You need smarter metrics.
This is where you see exactly what successes and what errors your model is making.
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
cm = confusion_matrix(y_test, y_pred)
# Visualize
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Ham (Pred)', 'Spam (Pred)'],
yticklabels=['Ham (Real)', 'Spam (Real)'])
plt.title("Confusion Matrix - Spam Classifier", fontsize=16)
plt.ylabel("True Label", fontsize=12)
plt.xlabel("Predicted Label", fontsize=12)
plt.show()
๐ Typical output (approximate values):
Predicted
Ham Spam
Real Ham 950 5
Real Spam 15 145
True Negatives (TN): 950
โ Ham messages that the model said were ham. โ
Perfect!
False Positives (FP): 5
โ Ham messages that the model said were spam. โ Serious error! (Marking an important message as spam).
False Negatives (FN): 15
โ Spam messages that the model said were ham. โ Serious error! (Letting spam through).
True Positives (TP): 145
โ Spam messages that the model said were spam. โ
Perfect!
๐ This is gold! Now you know where your model fails. It's not an abstract number... these are concrete errors you can improve.
Now, let's quantify those errors with professional metrics.
from sklearn.metrics import classification_report
report = classification_report(y_test, y_pred,
target_names=['Ham', 'Spam'],
output_dict=False)
print(report)
๐ Typical output:
precision recall f1-score support
Ham 0.98 0.99 0.99 955
Spam 0.97 0.91 0.94 160
accuracy 0.98 1115
macro avg 0.98 0.95 0.96 1115
weighted avg 0.98 0.98 0.98 1115
"Of all those I said were spam, how many really were?"
Precision (Spam) = TP / (TP + FP) = 145 / (145 + 5) = 145/150 โ 0.97
โ 97% precision in spam: when the model says "spam", it's right 97% of the time. Excellent!
๐ When does precision matter?
When the cost of a false positive is high.
Example: Marking an important email as spam โ the user might lose critical information.
"Of all the spam that existed, how many did I detect?"
Recall (Spam) = TP / (TP + FN) = 145 / (145 + 15) = 145/160 โ 0.91
โ 91% recall in spam: it detected 91% of all spam. Very good!
๐ When does recall matter?
When the cost of a false negative is high.
Example: Letting fraudulent spam through โ the user might click and lose money.
"Harmonic average between precision and recall. Ideal when you want balance."
F1 = 2 * (Precision * Recall) / (Precision + Recall)
= 2 * (0.97 * 0.91) / (0.97 + 0.91) โ 0.94
โ 94% F1-score: a good balance between not bothering the user (precision) and protecting them (recall).
๐ When to use F1?
When you don't know what's more important, or when you want a single metric that summarizes performance on imbalanced classes.
Your model doesn't just predict "spam" or "ham." It also gives you probabilities.
This is powerful. Because sometimes, you don't want a binary decision... you want to know how sure the model is.
# Get probabilities for each class
y_proba = model.predict_proba(X_test_vec)
# For spam (class 1), it's the second column
y_proba_spam = y_proba[:, 1]
# View the first 10 probabilities
for i in range(10):
print(f"Message {i+1}: Spam probability = {y_proba_spam[i]:.4f} โ Prediction: {'Spam' if y_pred[i] == 1 else 'Ham'}")
๐ Typical output:
Message 1: Spam probability = 0.0002 โ Prediction: Ham
Message 2: Spam probability = 0.9998 โ Prediction: Spam
Message 3: Spam probability = 0.0015 โ Prediction: Ham
...
โ Amazing! The model doesn't just say "spam," it tells you "I'm 99.98% sure."
What happens if you change the decision threshold?
By default, if probability > 0.5 โ spam.
But what if you use 0.7? Or 0.3?
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
precision, recall, thresholds = precision_recall_curve(y_test, y_proba_spam)
plt.figure(figsize=(10, 6))
plt.plot(recall, precision, marker='.', label='Naive Bayes')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid(True)
plt.show()
๐ What do you see?
A curve that shows the trade-off between precision and recall for different thresholds.
Ideal for choosing a threshold that fits your needs (more precision or more recall).
โ Calculate and interpret accuracy.
โ Build and read a confusion matrix.
โ Calculate and interpret precision, recall, and F1-score.
โ Know when to prioritize precision vs recall according to the problem.
โ Get and analyze prediction probabilities.
โ Understand that a "good" model depends on context, not just a number.
โ Feel comfortable evaluating models with professional rigor.
"Don't measure your model by how much it gets right... measure it by how much what it gets right matters."
โ Previous: Lesson 4: Train Your First Model | Next: Final Project โ
Course: AI-course0
Language: EN
Lesson: 5 evaluate model