person
Author: Process Fellows
These tools are typical examples for supporting a ML project.
CVAT — annotation:
CVAT is used to label data, especially images and videos. For example, if you want to train an object detection model, people can use CVAT to draw bounding boxes around cars, pedestrians, traffic signs, defects, or other objects.
In the ML pipeline, CVAT answers the question:
“What is shown in the data, and what should the model learn?”
Example:
A camera image contains three cars. In CVAT, an annotator marks each car with a bounding box and the label `car`. The model later learns from these annotations.
DVC — dataset versioning:
DVC stands for "Data Version Control". It is like Git, but for large datasets and machine learning artifacts.
In machine learning, datasets often change: new images are added, bad labels are corrected, or test data is separated from training data. DVC helps track exactly which dataset version was used for a specific experiment.
DVC answers the question:
“Which version of the data was used to train this model?”
Example:
- Model A was trained on dataset version 1.0.
- Model B was trained on dataset version 1.1 with 500 additional annotated images.
DVC helps reproduce both experiments later.
MLflow — metadata and experiment tracking:
MLflow is used to track machine learning experiments. It records information such as parameters, metrics, models, and results.
For example, when training a neural network, MLflow can store:
training accuracy, validation loss, learning rate, batch size, model version, training time, used dataset version, and output model file.
MLflow answers the question:
“Which experiment produced which result?”
Example:
- Experiment 1 used learning rate `0.001` and reached 87% accuracy.
- Experiment 2 used learning rate `0.0001` and reached 91% accuracy.
MLflow makes these experiments comparable.
S3 / MinIO — storage:
S3 is a cloud storage system, and MinIO is an S3-compatible storage solution that can be self-hosted. In machine learning, they are often used to store large files such as datasets, images, trained models, annotations, and logs.
S3 or MinIO answer the question:
“Where do we store large ML files reliably?”
Example:
- Raw images are stored in S3 or MinIO.
- DVC tracks which files belong to which dataset version.
MLflow stores model artifacts and experiment outputs there as well.