TRL (Transformer Reinforcement Learning) is a Hugging Face library specifically designed for training language models with modern approaches: from Supervised Fine-Tuning (SFT) to advanced techniques like RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization).
For our case — supervised fine-tuning with LoRA/QLoRA — we’ll use the SFTTrainer, a class extending Hugging Face’s standard Trainer but optimized for text generation tasks. Key advantages:
# Install TRL (if not done before)
!pip install -q trl
# Import key components
from trl import SFTTrainer
from transformers import TrainingArguments
TrainingArguments defines all training hyperparameters: batch size, epochs, learning rate, checkpointing, logging, etc.
training_args = TrainingArguments(
output_dir="./results", # Directory to save checkpoints and logs
num_train_epochs=3, # Number of full training epochs
per_device_train_batch_size=4, # Batch size per GPU (adjust based on memory)
gradient_accumulation_steps=4, # Accumulate gradients to simulate larger batches
optim="paged_adamw_8bit", # Memory-efficient optimizer (essential for QLoRA)
save_steps=500, # Save checkpoint every 500 steps
logging_steps=100, # Log metrics every 100 steps
learning_rate=2e-4, # LoRA learning rate (typical: 1e-4 to 3e-4)
weight_decay=0.01, # L2 regularization
fp16=True, # Mixed-precision training (FP16)
bf16=False, # Disabled unless GPU supports BF16 (A100, H100)
max_grad_norm=0.3, # Gradient clipping for stability
warmup_ratio=0.03, # Linear learning rate warmup
lr_scheduler_type="cosine", # Cosine learning rate decay
report_to="wandb", # Report metrics to Weights & Biases (optional)
evaluation_strategy="steps", # Evaluate during training
eval_steps=500, # Evaluate every 500 steps
save_total_limit=2, # Keep only the 2 latest checkpoints
load_best_model_at_end=True, # Load best model at end (by evaluation metric)
metric_for_best_model="eval_loss", # Metric defining "best model"
greater_is_better=False, # Lower loss is better
push_to_hub=False, # Do not upload to Hugging Face Hub (optional)
)
Key Notes:
per_device_train_batch_size=4+gradient_accumulation_steps=4= effective batch size of 16.optim="paged_adamw_8bit"is essential to avoid OOM in QLoRA.fp16=Trueaccelerates training and reduces memory. If your GPU supports BF16 (Ampere+), usebf16=Trueandfp16=False.report_to="wandb"requires a free Weights & Biases account. Otherwise, usereport_to="none".
The SFTTrainer expects a dataset with a formatted text field. We'll use the format_instruction function from Module 6.
from datasets import Dataset
# Assume we have Alpaca-formatted examples
dataset_dict = {
"instruction": [
"Write a short description for a technology product.",
"Summarize the following text in one sentence.",
],
"input": [
"Product: Wireless headphones with noise cancellation. Price: $129.99.",
"Generative AI is transforming industries like education, entertainment, and healthcare by enabling automated creation of high-quality content.",
],
"output": [
"Enjoy your music without distractions with these high-fidelity wireless headphones. With active noise cancellation and up to 30 hours of battery life, they’re ideal for travel, work, or simply relaxing. Just $129.99.",
"Generative AI is revolutionizing key sectors by automating the creation of high-quality content.",
]
}
# Convert to Hugging Face Dataset
dataset = Dataset.from_dict(dataset_dict)
# Apply formatting
def formatting_prompts_func(examples):
instructions = examples["instruction"]
inputs = examples["input"]
outputs = examples["output"]
texts = []
for instruction, input_text, output in zip(instructions, inputs, outputs):
text = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n{output}"
texts.append(text)
return texts
# The SFTTrainer will use this function to format examples
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
formatting_func=formatting_prompts_func, # Function to format prompts
max_seq_length=512, # Maximum sequence length
tokenizer=tokenizer,
packing=False, # Do not pack sequences (better for instruction tuning)
dataset_text_field="text", # Field containing text (unnecessary if using formatting_func)
)
# Start training
trainer.train()
Important: If the dataset is large, split into
train_datasetandeval_datasetand pass both to the trainer. Here, for simplicity, we use training only.