📚 Module 10: Saving, Loading, and Merging LoRA/QLoRA Adapters

10.1 Why Not Save the Full Model?

One of PEFT’s greatest benefits is that you only need to save the adapter parameters (LoRA), not the full base model. This has enormous practical implications:

  • A LoRA adapter for a 7B model can occupy less than 10 MB, compared to 14+ GB for the full model in FP16.
  • You can train multiple adapters for different tasks and load them onto the same base model as needed.
  • Facilitates sharing, versioning, and storage of specialized models.

10.2 Saving the Trained Adapter

After training, the LoRA adapter is saved as additional weights. The base model remains untouched.

# Save the LoRA adapter
model.save_pretrained("./lora_adapter")

# Save tokenizer (if modified, though rare)
tokenizer.save_pretrained("./lora_adapter")

This creates a directory ./lora_adapter with files like:

  • adapter_config.json — LoRA configuration (r, alpha, target_modules, etc.)
  • adapter_model.bin — LoRA weights (A and B)
  • README.md (optional) — Metadata

Important: The base model is not saved here. You must retain access to the original base model (e.g., from Hugging Face Hub) to later load the adapter.

10.3 Loading a LoRA Adapter Onto a Base Model

To use the trained adapter:

  1. Load the base model (with or without quantization).
  2. Load the LoRA adapter onto it.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

# Quantization config (optional for efficient inference)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load base model
model_name = "Qwen/Qwen2.5-0.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Load LoRA adapter
model = PeftModel.from_pretrained(model, "./lora_adapter")

# Model now has specialized behavior
model.eval()  # Set to evaluation mode for inference

10.4 Merging the Adapter with the Base Model (merge_and_unload)

Although dynamic adapter loading is flexible, for production deployment or faster inference, merging LoRA weights into the base model is useful. This creates a complete, specialized model requiring no PEFT infrastructure during inference.

# Merge LoRA adapter with base model
model = model.merge_and_unload()

# Now the model is a complete model with updated weights
# Save as a standard Hugging Face model
model.save_pretrained("./merged_model")
tokenizer.save_pretrained("./merged_model")

Warning:

  • Once merged, you cannot reload another adapter without reloading the original base model.
  • The merged model occupies the same disk space as the original base model (~1GB for Qwen2.5-0.5B in FP16).
  • Merging is only possible if the model is in full precision (FP16/BF16). If quantized to 4-bit, first dequantize (requires more memory).

Merging with a Quantized Model:

# If model is 4-bit, first dequantize (requires more VRAM)
model = model.dequantize()  # Converts weights to BF16/FP16

# Then merge
model = model.merge_and_unload()

# Save
model.save_pretrained("./merged_model_full_precision")

10.5 Loading the Merged Model for Inference

Once merged and saved, the model behaves like any standard Hugging Face model:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "./merged_model",
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("./merged_model", trust_remote_code=True)

# Ready for inference without PEFT!