🛠️ Part 5: Practical Project — Apply quantization and measure gains

Project Objective

Take a small language model (e.g., distilbert-base-uncased or TinyLlama-1.1B), apply 8-bit and 4-bit quantization, and compare:

  • Model size before and after.
  • Average inference time.
  • Accuracy on a text classification task.
  • Memory usage during inference.

Tools to use

  • Hugging Face Transformers → to load model and tokenizer.
  • Hugging Face Optimum → to apply quantization with support for ONNX and specific hardware.
  • ONNX Runtime → to efficiently run quantized models.
  • TensorFlow Lite (optional) → if exploring compression for mobile.
  • Jupyter Notebook or Google Colab → for development and visualization.

Detailed Steps

Step 1: Installation and setup

pip install transformers optimum[onnxruntime] torch onnx onnxruntime

Step 2: Load the original model

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Step 3: Measure original size and performance

  • Save the model to disk and measure size in MB.
  • Perform 100 inferences and measure average time.
  • Measure memory usage with torch.cuda.memory_allocated() or similar.

Step 4: Apply dynamic 8-bit quantization with Optimum

from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig

# First export to ONNX
from optimum.onnxruntime import ORTModelForSequenceClassification
model_ort = ORTModelForSequenceClassification.from_pretrained(model_name, from_transformers=True)
model_ort.save_pretrained("./onnx_model")

# Configure dynamic INT8 quantization
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
quantizer = ORTQuantizer.from_pretrained("./onnx_model")
quantizer.quantize(save_dir="./quantized_model", quantization_config=dqconfig)

Step 5: Load quantized model and measure

from optimum.onnxruntime import ORTModelForSequenceClassification

model_quant = ORTModelForSequenceClassification.from_pretrained("./quantized_model")
# Repeat size, time, and memory measurements

Step 6: Apply 4-bit quantization (using bitsandbytes if possible)

import torch
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

model_4bit = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)
# Measure size, time, memory, and accuracy again

Step 7: Compare results in a table

Model Size (MB) Latency (ms) VRAM Usage (MB) Accuracy (%)
Original FP32 267 45 1024 91.2
Quantized INT8 67 28 256 90.8
Quantized NF4 34 35* 128 90.1

(* 4-bit quantization may be slower on some hardware due to decompression overhead)

Step 8: Conclusions and discussion

  • Which technique offers the best trade-off for this case?
  • Is the accuracy loss acceptable?
  • In which scenarios would you use each version?

Course Info

Course: AI-course4

Language: EN

Lesson: Module5