\"Don't start cooking without a recipe. Don't build a model without a plan.\"
Many beginners make the same mistake: they want to jump directly to the model.
\"I want to train an AI like ChatGPT NOW!\"
But that's like wanting to build an airplane without knowing what a screw is.
In this lesson, you won't just learn the steps of an ML projectโฆ you'll learn why each step exists, what happens if you skip it, and how to think like a data scientist from minute one.
At the end, you'll have a clear mental map that you can apply to any project: spam classification, price prediction, fraud detection, medical diagnosis, anything!
Imagine you're a pirate looking for buried treasure. You can't just start digging anywhere. You need:
That's exactly what we'll do in ML!
Let's break down each step in detail, with real examples, common mistakes, and expert tips.
\"A well-defined problem is half solved.\"
Before touching code, before looking for dataโฆ stop and think.
What do I want to predict?
Who will use this prediction?
Why is it important to solve this?
\"I'm going to use this Titanic dataset because it's cool.\"
No! The dataset is not the goal. The problem is the goal. The dataset is just a tool to solve it.
\"Data is the new oilโฆ but sometimes it comes full of mud.\"
Once the problem is defined, you need data. Without data, there's no ML.
We'll use this in this course. It's on Kaggle and is small, clean, and perfect for starting.
import pandas as pd
url = \"https://raw.githubusercontent.com/justmarkham/DAT8/master/data/sms.tsv\"
data = pd.read_csv(url, sep='\\t', names=['label', 'message'])
Never assume the data is clean. Always explore it first.
Ask yourself these questions:
print(data.shape) # (5572, 2) โ 5572 messages, 2 columns
print(data.head())
label message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
print(data['label'].value_counts())
ham 4825
spam 747
Name: label, dtype: int64
โ We have many more \"ham\" than \"spam\"! This is important (we'll see in evaluation).
print(data.isnull().sum())
โ In this case, no. But in real life, there almost always are!
data['length'] = data['message'].apply(len)
print(data['length'].describe())
count 5572.000000
mean 80.489052
std 59.942492
min 2.000000
25% 36.000000
50% 61.000000
75% 111.000000
max 910.000000
โ There are messages up to 910 characters! Will they be spam? Will they be normal?
import matplotlib.pyplot as plt
import seaborn as sns
sns.histplot(data=data, x='length', hue='label', bins=50)
plt.title(\"Message length distribution by type\")
plt.show()
โ You'll see that spam messages tend to be longer. That's a valuable clue!
\"Garbage in, garbage out.\" โ Garbage In, Garbage Out (GIGO) Law
ML models are like Formula 1 engines: very powerful, but very sensitive to fuel quality.
# If there were nulls, you could:
# data = data.dropna() # Remove rows with nulls
# or
# data['column'] = data['column'].fillna(data['column'].mean()) # Fill with mean
# Convert 'ham'/'spam' to 0/1
data['label'] = data['label'].map({'ham': 0, 'spam': 1})
Never, never, never train and evaluate with the same data!
from sklearn.model_selection import train_test_split
X = data['message'] # Features
y = data['label'] # Target
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% for test
random_state=42 # For reproducible results
)
print(f\"Train: {len(X_train)} messages\")
print(f\"Test: {len(X_test)} messages\")
Models don't understand text. They understand numbers.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train) # Learn vocabulary + transform
X_test_vec = vectorizer.transform(X_test) # Only transform (don't learn!)
print(f\"Vocabulary: {len(vectorizer.vocabulary_)} unique words\")
print(f\"Shape of X_train_vec: {X_train_vec.shape}\") # (4457, 7358) โ 4457 messages, 7358 words
๐ What does CountVectorizer do?
X_train.Example:
Message: \"free money now\"
Vocabulary: ['free', 'money', 'now', 'click', 'here', ...]
Vector: [1, 1, 1, 0, 0, ...] โ \"free\" appears 1 time, \"money\" 1 time, etc.
\"This is where the computer learnsโฆ but you give it the tools.\"
Now, it's time to train!
For text classification, a good starting point is Multinomial Naive Bayes.
Why?
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_vec, y_train) # Train the model!
๐ What does fit() do?
That's it! Your model now \"knows\" how to distinguish spam from ham.
\"Don't trust your model. Put it to the test.\"
Training is easy. Evaluating well is what separates amateurs from professionals.
y_pred = model.predict(X_test_vec)
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, y_pred)
print(f\"Accuracy: {acc:.4f}\") # E.g.: 0.9825 โ 98.25% correct
โ Looks excellent! But is it enough?
from sklearn.metrics import confusion_matrix
import seaborn as sns
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Ham', 'Spam'],
yticklabels=['Ham', 'Spam'])
plt.title(\"Confusion Matrix\")
plt.ylabel(\"True\")
plt.xlabel(\"Predicted\")
plt.show()
โ It will show you something like:
Predicted
Ham Spam
True
Ham 950 5
Spam 10 150
โ What does this mean?
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, target_names=['Ham', 'Spam']))
precision recall f1-score support
Ham 0.99 0.99 0.99 955
Spam 0.97 0.94 0.95 160
accuracy 0.98 1115
macro avg 0.98 0.97 0.97 1115
weighted avg 0.98 0.98 0.98 1115
๐ What do these metrics mean?
Precision: Of all those I said were spam, how many really were?
โ Spam: 0.97 โ 97% of messages marked as spam were spam. Good!
Recall (Sensitivity): Of all spam that existed, how many did I detect?
โ Spam: 0.94 โ I detected 94% of spam. Very good!
F1-Score: Harmonic average of precision and recall. Ideal for imbalanced data.
Don't stop at the first version!
Now that you have a base, you can:
TfidfVectorizer).LogisticRegression, SVM).Data science is iterative. There's never a \"final version.\" There's always room for improvement.
โ The 5 steps of an ML project and why each is crucial.
โ How to explore a dataset before using it.
โ Why splitting into train/test is mandatory.
โ How to convert text to numbers (vectorization).
โ How to train a model with fit().
โ How to evaluate it with accuracy, confusion matrix, and classification report.
โ That the first model is never the lastโฆ there's always room for improvement!
\"In ML, the most important thing is not the modelโฆ it's the process.\"
โ Previous: Lesson 1: Welcome to AI | Next: Lesson 3: Data Exploration โ
Course: AI-course0
Language: EN
Lesson: 2 ml workflow