Which test data sets to be used for ML testing?

Question

Accepted Answer

The ML test data set is used for the final testing of the trained ML model and the deployed ML model. The ML test data set must not be used for training! This means that no significant changes/optimizations may be made based on the ML test data set. This is because with every optimization, information about the data set quickly finds its way into the model, leading to overfitting to the data set used.
If the test fails and optimization of the ML model is required, it must be ensured that the ML test data set remains reliable in order to guarantee compliance with ML requirements. Therefore, a change to the ML test data set may be necessary.

Regression tests:

Aim: to ensure that the deployed model (i.e. transferred to the target hardware) delivers the same results as the original model (on the training platform).
Causes for deviations: numerical differences caused by hardware-specific implementations (e.g. floating-point arithmetic, quantization).
How is it done?
- Same test data set is used for testing the trained model and the deployed model.
- Outputs are compared: if there are deviations, they are analyzed to see if they are within an acceptable tolerance. Example: If the prediction error < 1%, the deployed model is considered as “stable” (i.e., acceptance criteria to be defined as part of MLE.1!)
- Remark: In safety-critical applications (e.g. autonomous driving, medical technology), even small differences may be unacceptable.

Additional test data:

The neural network is now running in a real environment, HW-specific aspects need to be tested, e.g.
- Performance tests: speed, memory consumption, latency times
- Robustness tests: behavior in case of heat, voltage fluctuations, memory errors
- Edge cases / malfunctions: e.g. how does the model behave with incorrect or noisy inputs on the target hardware?
How is it done? Test data sets can be extended by:
- Inputs with higher noise or extreme values
- Live data from the target hardware (e.g. sensor data instead of static test images)
- Performance tests under load

Knowledge Nugget