⚖️ Part 4: Trade-offs — Balancing size, accuracy, and speed

No compression technique is magical. There’s always a trade-off: you gain in one aspect, but lose (or must compensate) in another.

The compression triangle

          Accuracy
             /\
            /  \
           /    \
          /      \
     Size ———— Speed

Reduce size (via pruning or quantization) → may reduce accuracy, but improves speed and memory.
Improve speed (via structured pruning or quantization) → may require sacrificing accuracy or size.
Maintain accuracy → may require more size or longer inference time.

How to choose the right technique?

Depends on the use case:

For mobile or IoT: prioritize size and speed → quantization + structured pruning.
For high-concurrency servers: prioritize speed and efficiency → INT8 quantization + distillation.
For critical tasks where accuracy is vital: use QAT + distillation, avoid aggressive pruning.
For rapid prototyping: use PTQ + dynamic quantization.

Metrics to evaluate success

It’s not enough to say “the model is smaller.” You must measure:

Model size (MB or GB).
Inference latency (ms per prediction).
Throughput (predictions per second).
RAM/VRAM memory usage.
Accuracy/Recall/F1 on the validation set.
Energy consumption (if measurable).

Tools for measurement

torchinfo or tensorflow_model_analysis for size and parameters.
time or torch.utils.benchmark for latency.
nvidia-smi or htop for memory usage.
Hugging Face evaluate for accuracy metrics.

← Module3 Module5 →

Table of Contents

Course Info

Course: AI-course4

Language: EN

Lesson: Module4