📚 Module 7: Training Configuration with TRL (Transformer Reinforcement Learning)

7.1 What Is TRL and Why Use It?

TRL (Transformer Reinforcement Learning) is a Hugging Face library specifically designed for training language models with modern approaches: from Supervised Fine-Tuning (SFT) to advanced techniques like RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization).

For our case — supervised fine-tuning with LoRA/QLoRA — we’ll use the SFTTrainer, a class extending Hugging Face’s standard Trainer but optimized for text generation tasks. Key advantages:

Automatic handling of variable-length sequences.
Built-in support for instruction datasets (Alpaca format).
Native compatibility with PEFT and quantization.
Integration with monitoring tools like Weights & Biases (wandb).
Memory and performance optimizations for efficient training.

7.2 Installation and Initial Setup

# Install TRL (if not done before)
!pip install -q trl

# Import key components
from trl import SFTTrainer
from transformers import TrainingArguments

7.3 Configuring TrainingArguments

TrainingArguments defines all training hyperparameters: batch size, epochs, learning rate, checkpointing, logging, etc.

training_args = TrainingArguments(
    output_dir="./results",              # Directory to save checkpoints and logs
    num_train_epochs=3,                  # Number of full training epochs
    per_device_train_batch_size=4,       # Batch size per GPU (adjust based on memory)
    gradient_accumulation_steps=4,       # Accumulate gradients to simulate larger batches
    optim="paged_adamw_8bit",            # Memory-efficient optimizer (essential for QLoRA)
    save_steps=500,                      # Save checkpoint every 500 steps
    logging_steps=100,                   # Log metrics every 100 steps
    learning_rate=2e-4,                  # LoRA learning rate (typical: 1e-4 to 3e-4)
    weight_decay=0.01,                   # L2 regularization
    fp16=True,                           # Mixed-precision training (FP16)
    bf16=False,                          # Disabled unless GPU supports BF16 (A100, H100)
    max_grad_norm=0.3,                   # Gradient clipping for stability
    warmup_ratio=0.03,                   # Linear learning rate warmup
    lr_scheduler_type="cosine",          # Cosine learning rate decay
    report_to="wandb",                   # Report metrics to Weights & Biases (optional)
    evaluation_strategy="steps",         # Evaluate during training
    eval_steps=500,                      # Evaluate every 500 steps
    save_total_limit=2,                  # Keep only the 2 latest checkpoints
    load_best_model_at_end=True,         # Load best model at end (by evaluation metric)
    metric_for_best_model="eval_loss",   # Metric defining "best model"
    greater_is_better=False,             # Lower loss is better
    push_to_hub=False,                   # Do not upload to Hugging Face Hub (optional)
)

Key Notes:

per_device_train_batch_size=4 + gradient_accumulation_steps=4 = effective batch size of 16.

optim="paged_adamw_8bit" is essential to avoid OOM in QLoRA.

fp16=True accelerates training and reduces memory. If your GPU supports BF16 (Ampere+), use bf16=True and fp16=False.

report_to="wandb" requires a free Weights & Biases account. Otherwise, use report_to="none".

7.4 Preparing the Dataset for SFTTrainer

The SFTTrainer expects a dataset with a formatted text field. We'll use the format_instruction function from Module 6.

from datasets import Dataset

# Assume we have Alpaca-formatted examples
dataset_dict = {
    "instruction": [
        "Write a short description for a technology product.",
        "Summarize the following text in one sentence.",
    ],
    "input": [
        "Product: Wireless headphones with noise cancellation. Price: $129.99.",
        "Generative AI is transforming industries like education, entertainment, and healthcare by enabling automated creation of high-quality content.",
    ],
    "output": [
        "Enjoy your music without distractions with these high-fidelity wireless headphones. With active noise cancellation and up to 30 hours of battery life, they’re ideal for travel, work, or simply relaxing. Just $129.99.",
        "Generative AI is revolutionizing key sectors by automating the creation of high-quality content.",
    ]
}

# Convert to Hugging Face Dataset
dataset = Dataset.from_dict(dataset_dict)

# Apply formatting
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, input_text, output in zip(instructions, inputs, outputs):
        text = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n{output}"
        texts.append(text)
    return texts

# The SFTTrainer will use this function to format examples

7.5 Creating and Starting the SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    formatting_func=formatting_prompts_func,  # Function to format prompts
    max_seq_length=512,                       # Maximum sequence length
    tokenizer=tokenizer,
    packing=False,                            # Do not pack sequences (better for instruction tuning)
    dataset_text_field="text",                # Field containing text (unnecessary if using formatting_func)
)

# Start training
trainer.train()

Important: If the dataset is large, split into train_dataset and eval_dataset and pass both to the trainer. Here, for simplicity, we use training only.

← Module06 Module08 →

Course Info

Course: AI-course3

Language: EN

Lesson: Module07