How can I optimize the ML model architecture after the training is done without negative impact of the model outcome quality?

Question

Accepted Answer

The ML model is typically considered as a black box! This question sounds like it would be impossible? Let's see what kind of techniques could be considered and used:

Pruning (weight thinning)
- Idea: Removal of unimportant neurons or connections in the network (some weights have very little influence on the final result).
- Various techniques (e.g. magnitude-based pruning) can be used to eliminate weights or entire neurons.
- Test: Retraining or evaluation is used to check whether the prediction performance is maintained.
Quantization (reduction of the precision of the parameters)
- Idea: Reducing the precision of the model parameters (e.g. instead of 32-bit float values for weights, reduce to 16-bit or even 8-bit).
- Modern hardware (e.g. TensorFlow Lite or special AI accelerators) efficiently supports such quantized models.
- Test: The accuracy of the quantized model is measured in comparison to the original model.
Knowledge distillation (knowledge transfer from large to small)
- Idea: A large model (“teacher”) trains a smaller model (“student”).
- A smaller model is trained to imitate the output of the larger model.
- As a result, the smaller model can often make similar predictions, even if it has fewer parameters.
- Test: Evaluation of the predictions of the “Student model” compared to the “Teacher model”.
Tip: If TensorFlow or PyTorch is used, there are libraries such as TensorFlow Model Optimization or Torch Pruning that automate many of these techniques.

Knowledge Nugget