๐Ÿ“˜ Lesson 3: Your First Interaction with Data โ€” Loading, Exploring, and Basic Preparation

"Data are not cold numbers. They are stories, patterns, errors, and opportunities. Learn to listen to them."


โฑ๏ธ Estimated duration of this lesson: 75-90 minutes


๐Ÿงญ Why is this lesson so important?

Because this is where theory becomes practice.

In Lesson 2 you learned the map. Now, you're going to walk the path.

You will:

  • Load a real dataset.
  • Explore it like a detective.
  • Clean it like a surgeon.
  • Prepare it like a chef.
  • And leave it ready for a model to understand.

โš ๏ธ Friendly warning: This lesson has more code than the previous ones. But don't be afraid. We'll do it step by step, with detailed explanations, common errors, and expert tips. You won't be alone.


๐ŸŽฏ Objectives of this lesson

By the end, you'll be able to:

โœ… Load a dataset from a URL or local file using Pandas.
โœ… Explore its structure, content, and possible problems (nulls, duplicates, strange values).
โœ… Create simple visualizations to understand patterns.
โœ… Prepare the data for the model: encode labels, split into train/test, vectorize text.
โœ… Understand why each preparation step is necessary.
โœ… Feel comfortable manipulating dataโ€ฆ your new raw material!


๐Ÿ› ๏ธ Tools you'll use

  • Pandas โ†’ The Swiss Army knife for data manipulation.
  • Matplotlib / Seaborn โ†’ For basic visualizations.
  • Scikit-learn โ†’ For splitting data and vectorizing text.
  • Jupyter Notebook or Google Colab โ†’ Your experimentation lab.

๐Ÿ’ก If you haven't done so yet, open Google Colab now: https://colab.research.google.com
Create a new notebook and let's get started!


๐Ÿ“ฅ Part 1: Loading the Data โ€” Your First Import

We'll use the SMS Spam Collection dataset. It's small, clean, and perfect for starting.

๐Ÿ”น Step 1: Import the necessary libraries

# Always start by importing what you need
import pandas as pd

๐Ÿ“Œ What is Pandas?
It's a Python library for manipulating and analyzing data. Think of it as Excel, but more powerful and programmable.


๐Ÿ”น Step 2: Load the dataset from a URL

# Dataset URL (hosted on GitHub)
url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/sms.tsv  "

# Load with pandas
# The file is tab-separated (\t), and has no header
data = pd.read_csv(url, sep='\t', names=['label', 'message'])

# Show the first 5 rows
print(data.head())

๐Ÿ“Œ Expected output:

  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...

โœ… Data loaded! You now have a Pandas DataFrame.


๐Ÿ”น Step 3: Understand the dataset structure

# How many rows and columns?
print(f"Dataset shape: {data.shape}")  # (5572, 2)

# Column names
print(f"Columns: {data.columns.tolist()}")  # ['label', 'message']

# Data types
print(data.dtypes)

๐Ÿ“Œ Output:

label      object
message    object
dtype: object

โ†’ Both columns are of type object (text in Pandas).


๐Ÿ”น Step 4: View basic statistics

# Statistical summary (only for numeric columns, but useful to see if there are nulls)
print(data.describe(include='all'))

๐Ÿ“Œ Key output:

       label           message
count   5572              5572
unique     2              5169
top      ham  Sorry, I'll call later
freq    4825                30

โ†’ unique=2 in label: there are only two values: 'ham' and 'spam'.
โ†’ top=ham: the most frequent value is 'ham'.
โ†’ freq=4825: 'ham' appears 4825 times.


๐Ÿ” Part 2: Exploring the Data โ€” Be a Data Detective

Now, let's dig deeper. Don't assume anything. Explore everything.


๐Ÿ”ธ Question 1: How many spam and how many ham are there?

print(data['label'].value_counts())

๐Ÿ“Œ Output:

ham     4825
spam     747
Name: label, dtype: int64

โ†’ We have an imbalanced dataset! There are 6.5 times more ham than spam.
โ†’ This is normal in spam detectionโ€ฆ but it will affect how we evaluate the model. We'll see this in Lesson 6!


๐Ÿ”ธ Question 2: Are there null values?

print(data.isnull().sum())

๐Ÿ“Œ Output:

label      0
message    0
dtype: int64

โ†’ Perfect! No null values. In real life, this is rare. Always check this.


๐Ÿ”ธ Question 3: How is the message length distributed?

# Create a new column: message length
data['length'] = data['message'].apply(len)

# Descriptive statistics
print(data['length'].describe())

๐Ÿ“Œ Output:

count    5572.000000
mean       80.489052
std        59.942492
min         2.000000
25%        36.000000
50%        61.000000
75%       111.000000
max       910.000000
Name: length, dtype: float64

โ†’ There are messages up to 910 characters! Will they be spam? Will they be normal?


๐Ÿ”ธ Question 4: Visualize the length distribution

import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style("whitegrid")

# Histogram of lengths, colored by label
plt.figure(figsize=(12, 6))
sns.histplot(data=data, x='length', hue='label', bins=50, kde=False)
plt.title("Message length distribution by type (Spam vs Ham)", fontsize=16)
plt.xlabel("Message length (characters)", fontsize=12)
plt.ylabel("Frequency", fontsize=12)
plt.legend(title='Type', labels=['Spam', 'Ham'])
plt.show()

๐Ÿ“Œ What do you see?

  • Spam messages tend to be longer (many around 150-200 characters).
  • Ham messages are shorter (concentrated between 20 and 100 characters).
  • This is a valuable clue! Length could be a useful feature for the model.

๐Ÿ”ธ Question 5: What words appear in spam?

Let's do a very basic text analysis.

# Filter only spam
spam_messages = data[data['label'] == 'spam']['message']

# Convert to lowercase and split into words
words = ' '.join(spam_messages).lower().split()

# Count word frequency
from collections import Counter
word_freq = Counter(words)

# Show the 20 most common words in spam
print("Most frequent words in SPAM:")
for word, freq in word_freq.most_common(20):
    print(f"{word}: {freq}")

๐Ÿ“Œ Typical output:

free: 167
to: 137
you: 117
call: 90
txt: 89
now: 87
...

โ†’ Of course! Words like "free", "call", "now" are very common in spam.
โ†’ This confirms that the model will be able to learn from these clues.


๐Ÿงน Part 3: Preparing the Data โ€” Cleaning and Transformation

Now, we'll prepare the data for the model. Remember: models understand numbers, not text.


๐Ÿ”น Step 1: Encode the labels (label encoding)

We'll convert 'ham' and 'spam' to 0 and 1.

# Create a mapping
label_map = {'ham': 0, 'spam': 1}

# Apply the mapping
data['label_encoded'] = data['label'].map(label_map)

# Verify
print(data[['label', 'label_encoded']].head())

๐Ÿ“Œ Output:

  label  label_encoded
0   ham              0
1   ham              0
2  spam              1
3   ham              0
4   ham              0

โ†’ Done! Now the label is numeric.


๐Ÿ”น Step 2: Split into Train and Test

Never train and evaluate with the same data!

from sklearn.model_selection import train_test_split

# Features (X) = messages
# Label (y) = label_encoded
X = data['message']
y = data['label_encoded']

# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,  # For reproducibility
    stratify=y        # Maintains the spam/ham proportion in train and test!
)

print(f"Train size: {len(X_train)} messages")
print(f"Test size: {len(X_test)} messages")
print(f"Spam proportion in train: {y_train.mean():.2%}")
print(f"Spam proportion in test: {y_test.mean():.2%}")

๐Ÿ“Œ Output:

Train size: 4457 messages
Test size: 1115 messages
Spam proportion in train: 13.42%
Spam proportion in test: 13.41%

โ†’ Perfect! The proportion is maintained thanks to stratify=y.


๐Ÿ”น Step 3: Vectorize the text โ€” Convert words to numbers

We'll use Scikit-learn's CountVectorizer.

from sklearn.feature_extraction.text import CountVectorizer

# Create the vectorizer
vectorizer = CountVectorizer()

# Learn the vocabulary and transform X_train
X_train_vec = vectorizer.fit_transform(X_train)

# Only transform X_test (don't learn from it!)
X_test_vec = vectorizer.transform(X_test)

# View the size
print(f"Vocabulary: {len(vectorizer.vocabulary_)} unique words")
print(f"X_train_vec shape: {X_train_vec.shape}")  # (4457, 7358)
print(f"X_test_vec shape: {X_test_vec.shape}")    # (1115, 7358)

๐Ÿ“Œ What does (4457, 7358) mean?

  • 4457 โ†’ number of messages in train.
  • 7358 โ†’ number of unique words in the vocabulary (learned from train).

โ†’ Each message is now a vector of 7358 numbers (most are 0, because not all words appear in all messages).


๐Ÿ”น Step 4 (Optional): Save the vocabulary

Want to see what words the vectorizer learned?

# Get the first 20 words from the vocabulary
vocab = vectorizer.get_feature_names_out()
print("First 20 words in the vocabulary:")
print(vocab[:20])

๐Ÿ“Œ Output:

['00', '000', '0000', '00000', '000000', '00001', '0001', '00011', '00012', '00015', '0002', '0003', '0004', '0005', '0006', '0007', '00080', '0009', '001', '0010']

โ†’ Oops! There are many numbers. Why? Because CountVectorizer by default takes everything as a word, including numbers and punctuation.

๐Ÿ’ก Professional tip: Later, you could improve this with:

  • stop_words='english' โ†’ remove common words ("the", "and", "is").
  • lowercase=True โ†’ convert to lowercase (it already does this by default).
  • token_pattern=r'\b[a-zA-Z]{2,}\b' โ†’ only words of 2+ letters, no numbers or signs.

But for now, it's fine! We're learning.


๐Ÿงช Part 4: Mini-Exploratory Project โ€” Make it Your Own!

Now, it's your turn to explore.

๐Ÿ”ธ Exercise 1: What's the longest message? Is it spam or ham?

# Find the index of the longest message
idx_max = data['length'].idxmax()
longest_message = data.loc[idx_max]

print(f"Length: {longest_message['length']} characters")
print(f"Type: {longest_message['label']}")
print(f"Message: {longest_message['message']}")

๐Ÿ“Œ Typical output:

Length: 910 characters
Type: spam
Message: "I HAVE A DATE ON SUNDAY WITH WILL!!..." (A VERY long spam!)

๐Ÿ”ธ Exercise 2: How many messages have more than 200 characters? What percentage are spam?

# Filter long messages
long_messages = data[data['length'] > 200]
total_long = len(long_messages)
spam_long = long_messages[long_messages['label'] == 'spam'].shape[0]

print(f"Messages > 200 characters: {total_long}")
print(f"Of these, spam: {spam_long} ({spam_long/total_long:.1%})")

๐Ÿ“Œ Typical output:

Messages > 200 characters: 45
Of these, spam: 43 (95.6%)

โ†’ Almost all long messages are spam! This confirms our visual hypothesis.


๐Ÿ”ธ Exercise 3: How clean is the text?

Look at some random messages. Do you see signs of punctuation, capital letters, numbers, spelling errors?

# Show 5 random messages
sample = data.sample(5, random_state=1)
for i, row in sample.iterrows():
    print(f"[{row['label']}] {row['message'][:100]}...")  # Only first 100 characters

โ†’ You'll see things like:

  • "U dun say so early hor..." โ†’ informal language, abbreviations.
  • "FreeMsg Hey there darling..." โ†’ mix of capital letters, signs, numbers.

๐Ÿ’ก Reflection: Do you think this will affect the model? How could you improve it? (Hint: text cleaning, lemmatization, etc. โ€” we'll see this in advanced courses).


โŒ Common Errors in this Lesson (Avoid Them!)

  1. Not using stratify in train_test_split โ†’ Imbalances train/test.
  2. Applying fit_transform on test โ†’ Data leakage. Only transform!
  3. Not exploring data before vectorizing โ†’ You miss patterns and errors.
  4. Being scared by high dimensionality (7358 columns) โ†’ It's normal in text! It's called "high-dimensional space".
  5. Not saving y_train and y_test as separate variables โ†’ Later you can't train or evaluate.

โœ… Checklist for this lesson โ€” What should you know how to do now?

โ˜ Load a dataset from a URL with Pandas.
โ˜ Explore its structure, unique values, nulls, and statistics.
โ˜ Create visualizations to understand patterns (length, frequent words).
โ˜ Encode text labels to numbers.
โ˜ Split data into train/test maintaining proportions (stratify).
โ˜ Vectorize text with CountVectorizer.
โ˜ Understand the shape of the resulting matrices.
โ˜ Ask exploratory questions and answer them with code.
โ˜ Feel comfortable with basic data manipulation.


๐ŸŽฏ Quote to remember:

"Before training a model, train your eyes. Learn to see what the data is telling you."


โ† Previous: Lesson 2: The Treasure Map | Next: Lesson 4: Train Your First Model โ†’

Course Info

Course: AI-course0

Language: EN

Lesson: 3 data exploration