📚 Module 4: QLoRA — High-Performance Quantized Fine-Tuning

4.1 Introduction to QLoRA

QLoRA (Quantized Low-Rank Adaptation) is a natural and revolutionary extension of LoRA, presented in the paper “QLoRA: Efficient Finetuning of Quantized LLMs” (Dettmers et al., University of Washington & MBZUAI, 2023). While LoRA reduces the number of trainable parameters, QLoRA goes one step further: it reduces the numerical precision of the base model weights to fit into low-memory GPUs while maintaining performance nearly on par with full fine-tuning.

QLoRA enables training models up to 70 billion parameters on a single 48GB GPU, and 30–40B models on 24GB GPUs. Moreover, 7B–13B models can be trained on 16GB GPUs — such as those freely offered by Google Colab.

4.2 What Is Quantization?

Quantization is a compression technique that reduces the precision of numbers representing neural network weights. Instead of storing each weight as a 32-bit (FP32) or 16-bit (FP16/BF16) floating-point number, they are stored as 8-bit integers (INT8) or even 4-bit integers (INT4).

For example:

FP32: 32 bits per weight → 4 bytes
INT4: 4 bits per weight → 0.5 bytes

This means a 7B-parameter model shrinks from ~28 GB in FP32 to just ~3.5 GB in INT4 — an 8x reduction.

4.3 Types of Quantization in QLoRA

QLoRA does not use arbitrary quantization. It employs advanced techniques to minimize precision loss:

a) NormalFloat (NF4)

QLoRA introduces a novel data type: 4-bit NormalFloat (NF4). Unlike standard INT4, which uses a uniform distribution, NF4 is optimized for data following a normal distribution (like neural network weights). This allows more faithful representation of values near zero, where most useful information resides.

b) Double Quantization

QLoRA applies quantization at two levels:

Quantizes model weights to 4 bits.
Also quantizes scaling constants (used to reverse quantization during computation) to 8 bits.

This saves approximately 0.375 bits per parameter, equivalent to an extra 315 MB saved in a 7B model.

c) Paged Optimizers

QLoRA uses “paged” optimizers that manage memory more efficiently, avoiding Out-of-Memory (OOM) errors when processing long sequences or large batches. This is especially useful in memory-constrained environments.

4.4 How Does QLoRA Work in Practice?

QLoRA combines three key components:

Base model quantization to 4-bit (NF4) — for storage and inference only.
LoRA applied over quantized weights — LoRA weights (A and B) remain in full precision (BF16/FP16) during training.
Mixed-precision computation — during forward and backward passes, quantized weights are temporarily dequantized to BF16 for high-precision calculations, then re-quantized.

This approach ensures:

Minimal GPU memory usage.
Sufficient computational precision to avoid performance degradation.
Only LoRA parameters are trained, preserving parametric efficiency.

4.5 Technical Requirements and Limitations

Requirements:

NVIDIA GPU with CUDA support.
bitsandbytes ≥ 0.41.0 (library implementing 4-bit quantization).
transformers ≥ 4.30.0.
accelerate and peft.

Limitations:

Only compatible with decoder-only architectures like GPT, Llama, Mistral, Qwen. Not compatible with encoder-decoder models like T5 or BART in standard configurations.
Not all models are supported. The model must have a compatible implementation with bitsandbytes and AutoModelForCausalLM.
Minor computational overhead. Temporary dequantization adds slight latency, but acceptable given memory savings.
Not suitable for training from scratch. QLoRA is designed exclusively for fine-tuning.

← Module03 Module05 →

Course Info

Course: AI-course3

Language: EN

Lesson: Module04