"Data are not cold numbers. They are stories, patterns, errors, and opportunities. Learn to listen to them."
Because this is where theory becomes practice.
In Lesson 2 you learned the map. Now, you're going to walk the path.
You will:
โ ๏ธ Friendly warning: This lesson has more code than the previous ones. But don't be afraid. We'll do it step by step, with detailed explanations, common errors, and expert tips. You won't be alone.
By the end, you'll be able to:
โ
Load a dataset from a URL or local file using Pandas.
โ
Explore its structure, content, and possible problems (nulls, duplicates, strange values).
โ
Create simple visualizations to understand patterns.
โ
Prepare the data for the model: encode labels, split into train/test, vectorize text.
โ
Understand why each preparation step is necessary.
โ
Feel comfortable manipulating dataโฆ your new raw material!
๐ก If you haven't done so yet, open Google Colab now: https://colab.research.google.com
Create a new notebook and let's get started!
We'll use the SMS Spam Collection dataset. It's small, clean, and perfect for starting.
# Always start by importing what you need
import pandas as pd
๐ What is Pandas?
It's a Python library for manipulating and analyzing data. Think of it as Excel, but more powerful and programmable.
# Dataset URL (hosted on GitHub)
url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/sms.tsv "
# Load with pandas
# The file is tab-separated (\t), and has no header
data = pd.read_csv(url, sep='\t', names=['label', 'message'])
# Show the first 5 rows
print(data.head())
๐ Expected output:
label message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
โ Data loaded! You now have a Pandas DataFrame.
# How many rows and columns?
print(f"Dataset shape: {data.shape}") # (5572, 2)
# Column names
print(f"Columns: {data.columns.tolist()}") # ['label', 'message']
# Data types
print(data.dtypes)
๐ Output:
label object
message object
dtype: object
โ Both columns are of type object (text in Pandas).
# Statistical summary (only for numeric columns, but useful to see if there are nulls)
print(data.describe(include='all'))
๐ Key output:
label message
count 5572 5572
unique 2 5169
top ham Sorry, I'll call later
freq 4825 30
โ unique=2 in label: there are only two values: 'ham' and 'spam'.
โ top=ham: the most frequent value is 'ham'.
โ freq=4825: 'ham' appears 4825 times.
Now, let's dig deeper. Don't assume anything. Explore everything.
print(data['label'].value_counts())
๐ Output:
ham 4825
spam 747
Name: label, dtype: int64
โ We have an imbalanced dataset! There are 6.5 times more ham than spam.
โ This is normal in spam detectionโฆ but it will affect how we evaluate the model. We'll see this in Lesson 6!
print(data.isnull().sum())
๐ Output:
label 0
message 0
dtype: int64
โ Perfect! No null values. In real life, this is rare. Always check this.
# Create a new column: message length
data['length'] = data['message'].apply(len)
# Descriptive statistics
print(data['length'].describe())
๐ Output:
count 5572.000000
mean 80.489052
std 59.942492
min 2.000000
25% 36.000000
50% 61.000000
75% 111.000000
max 910.000000
Name: length, dtype: float64
โ There are messages up to 910 characters! Will they be spam? Will they be normal?
import matplotlib.pyplot as plt
import seaborn as sns
# Set style
sns.set_style("whitegrid")
# Histogram of lengths, colored by label
plt.figure(figsize=(12, 6))
sns.histplot(data=data, x='length', hue='label', bins=50, kde=False)
plt.title("Message length distribution by type (Spam vs Ham)", fontsize=16)
plt.xlabel("Message length (characters)", fontsize=12)
plt.ylabel("Frequency", fontsize=12)
plt.legend(title='Type', labels=['Spam', 'Ham'])
plt.show()
๐ What do you see?
Let's do a very basic text analysis.
# Filter only spam
spam_messages = data[data['label'] == 'spam']['message']
# Convert to lowercase and split into words
words = ' '.join(spam_messages).lower().split()
# Count word frequency
from collections import Counter
word_freq = Counter(words)
# Show the 20 most common words in spam
print("Most frequent words in SPAM:")
for word, freq in word_freq.most_common(20):
print(f"{word}: {freq}")
๐ Typical output:
free: 167
to: 137
you: 117
call: 90
txt: 89
now: 87
...
โ Of course! Words like "free", "call", "now" are very common in spam.
โ This confirms that the model will be able to learn from these clues.
Now, we'll prepare the data for the model. Remember: models understand numbers, not text.
We'll convert 'ham' and 'spam' to 0 and 1.
# Create a mapping
label_map = {'ham': 0, 'spam': 1}
# Apply the mapping
data['label_encoded'] = data['label'].map(label_map)
# Verify
print(data[['label', 'label_encoded']].head())
๐ Output:
label label_encoded
0 ham 0
1 ham 0
2 spam 1
3 ham 0
4 ham 0
โ Done! Now the label is numeric.
Never train and evaluate with the same data!
from sklearn.model_selection import train_test_split
# Features (X) = messages
# Label (y) = label_encoded
X = data['message']
y = data['label_encoded']
# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42, # For reproducibility
stratify=y # Maintains the spam/ham proportion in train and test!
)
print(f"Train size: {len(X_train)} messages")
print(f"Test size: {len(X_test)} messages")
print(f"Spam proportion in train: {y_train.mean():.2%}")
print(f"Spam proportion in test: {y_test.mean():.2%}")
๐ Output:
Train size: 4457 messages
Test size: 1115 messages
Spam proportion in train: 13.42%
Spam proportion in test: 13.41%
โ Perfect! The proportion is maintained thanks to stratify=y.
We'll use Scikit-learn's CountVectorizer.
from sklearn.feature_extraction.text import CountVectorizer
# Create the vectorizer
vectorizer = CountVectorizer()
# Learn the vocabulary and transform X_train
X_train_vec = vectorizer.fit_transform(X_train)
# Only transform X_test (don't learn from it!)
X_test_vec = vectorizer.transform(X_test)
# View the size
print(f"Vocabulary: {len(vectorizer.vocabulary_)} unique words")
print(f"X_train_vec shape: {X_train_vec.shape}") # (4457, 7358)
print(f"X_test_vec shape: {X_test_vec.shape}") # (1115, 7358)
๐ What does (4457, 7358) mean?
โ Each message is now a vector of 7358 numbers (most are 0, because not all words appear in all messages).
Want to see what words the vectorizer learned?
# Get the first 20 words from the vocabulary
vocab = vectorizer.get_feature_names_out()
print("First 20 words in the vocabulary:")
print(vocab[:20])
๐ Output:
['00', '000', '0000', '00000', '000000', '00001', '0001', '00011', '00012', '00015', '0002', '0003', '0004', '0005', '0006', '0007', '00080', '0009', '001', '0010']
โ Oops! There are many numbers. Why? Because CountVectorizer by default takes everything as a word, including numbers and punctuation.
๐ก Professional tip: Later, you could improve this with:
stop_words='english' โ remove common words ("the", "and", "is").lowercase=True โ convert to lowercase (it already does this by default).token_pattern=r'\b[a-zA-Z]{2,}\b' โ only words of 2+ letters, no numbers or signs.But for now, it's fine! We're learning.
Now, it's your turn to explore.
# Find the index of the longest message
idx_max = data['length'].idxmax()
longest_message = data.loc[idx_max]
print(f"Length: {longest_message['length']} characters")
print(f"Type: {longest_message['label']}")
print(f"Message: {longest_message['message']}")
๐ Typical output:
Length: 910 characters
Type: spam
Message: "I HAVE A DATE ON SUNDAY WITH WILL!!..." (A VERY long spam!)
# Filter long messages
long_messages = data[data['length'] > 200]
total_long = len(long_messages)
spam_long = long_messages[long_messages['label'] == 'spam'].shape[0]
print(f"Messages > 200 characters: {total_long}")
print(f"Of these, spam: {spam_long} ({spam_long/total_long:.1%})")
๐ Typical output:
Messages > 200 characters: 45
Of these, spam: 43 (95.6%)
โ Almost all long messages are spam! This confirms our visual hypothesis.
Look at some random messages. Do you see signs of punctuation, capital letters, numbers, spelling errors?
# Show 5 random messages
sample = data.sample(5, random_state=1)
for i, row in sample.iterrows():
print(f"[{row['label']}] {row['message'][:100]}...") # Only first 100 characters
โ You'll see things like:
๐ก Reflection: Do you think this will affect the model? How could you improve it? (Hint: text cleaning, lemmatization, etc. โ we'll see this in advanced courses).
stratify in train_test_split โ Imbalances train/test.fit_transform on test โ Data leakage. Only transform!y_train and y_test as separate variables โ Later you can't train or evaluate.โ Load a dataset from a URL with Pandas.
โ Explore its structure, unique values, nulls, and statistics.
โ Create visualizations to understand patterns (length, frequent words).
โ Encode text labels to numbers.
โ Split data into train/test maintaining proportions (stratify).
โ Vectorize text with CountVectorizer.
โ Understand the shape of the resulting matrices.
โ Ask exploratory questions and answer them with code.
โ Feel comfortable with basic data manipulation.
"Before training a model, train your eyes. Learn to see what the data is telling you."
โ Previous: Lesson 2: The Treasure Map | Next: Lesson 4: Train Your First Model โ
Course: AI-course0
Language: EN
Lesson: 3 data exploration