⚖️ Part 4: Trade-offs — Balancing size, accuracy, and speed
No compression technique is magical. There’s always a trade-off: you gain in one aspect, but lose (or must compensate) in another.
The compression triangle
Accuracy
/\
/ \
/ \
/ \
Size ———— Speed
- Reduce size (via pruning or quantization) → may reduce accuracy, but improves speed and memory.
- Improve speed (via structured pruning or quantization) → may require sacrificing accuracy or size.
- Maintain accuracy → may require more size or longer inference time.
How to choose the right technique?
Depends on the use case:
- For mobile or IoT: prioritize size and speed → quantization + structured pruning.
- For high-concurrency servers: prioritize speed and efficiency → INT8 quantization + distillation.
- For critical tasks where accuracy is vital: use QAT + distillation, avoid aggressive pruning.
- For rapid prototyping: use PTQ + dynamic quantization.
Metrics to evaluate success
It’s not enough to say “the model is smaller.” You must measure:
- Model size (MB or GB).
- Inference latency (ms per prediction).
- Throughput (predictions per second).
- RAM/VRAM memory usage.
- Accuracy/Recall/F1 on the validation set.
- Energy consumption (if measurable).
Tools for measurement
torchinfo or tensorflow_model_analysis for size and parameters.
time or torch.utils.benchmark for latency.
nvidia-smi or htop for memory usage.
Hugging Face evaluate for accuracy metrics.