For text generation tasks (chat, instructions, QA), the most common format is the Alpaca format, consisting of a JSON with three fields per example:
{
"instruction": "Write a short description for a technology product.",
"input": "Product: Wireless headphones with noise cancellation. Price: $129.99.",
"output": "Enjoy your music without distractions with these high-fidelity wireless headphones. With active noise cancellation and up to 30 hours of battery life, they’re ideal for travel, work, or simply relaxing. Just $129.99."
}
instruction: The task the model must perform.input: Additional context or input (optional).output: The desired response.This format must be converted into tensors the model can understand. The model’s tokenizer converts text into IDs, and a chat template is applied if required (as in Qwen or Llama 3).
def format_instruction(example):
return f"""### Instruction:
{example['instruction']}
### Input:
{example['input']}
### Response:
{example['output']}"""
# Tokenization
def tokenize_function(example):
text = format_instruction(example)
tokenized = tokenizer(
text,
truncation=True,
max_length=512,
padding="max_length",
)
tokenized["labels"] = tokenized["input_ids"].copy()
return tokenized
Important: In instruct models, it's common to mask input tokens (instruction + input) in
labels, so the model computes loss only on the output. This is done by assigning-100to those tokens (ignored by PyTorch’s loss function).