Spice4Cars

Establish an ML data management system which supports

ML data management activities,
relevant sources of ML data,
ML data life cycle including a status model, and
interfaces to affected parties.

Note 1: Supported ML data management activities may include data collection, labeling/annotation, and structuring.

Linked Knowledge Nuggets:
arrow_forward "Important aspects for collecting data"

person Author: Process Fellows

Select data sources
- Where does the data come from?
  - Own databases
  - External APIs
  - Web scraping (if permitted)
  - Public data sets, e.g. Kaggle: online platform for data science and machine learning by Google; UCI ML repository: One of the oldest collection of public accessible data sets for machine learning by University of California, Irvine
- Your own measurements
- Tip: Check whether you have access to the data and whether it may be used under license law.
Ensure data quality: Good data is the key to successful machine learning! Garbage in means Garbage out means Bad Model!
- Completeness: Are important values missing? (e.g. missing prices in sales data)
- Consistency: Do units and formats match? (e.g. date in the same format)
- Correctness: Are there obvious errors? (e.g. negative prices)
- Representativeness: Does the data cover all possible cases? (Avoidance of bias!)
- Balance: Are there unbalanced data classes? (e.g. highly uneven distribution in a classification)
- Tip: Clean and normalize data before modeling! “Normalize” means: Pre-processing step: numerical values are transformed into a specific range (e.g. between 0 and 1 or -1 and 1). Aim is to eliminate scaling differences between features and thus improve the performance of ML algorithms. Example: Min-Max-scaling (but there are other approaches too)
Determine amount of data: How much data do you need?
- Supervised Learning: The more complex the model, the more data is required.
- Unsupervised Learning: Clustering can often work with less data.
- Deep Learning: Often requires a lot of data (e.g. millions of images for image recognition).
- Tip: If data is scarce, use “data augmentation” (e.g. rotate images, vary texts).
Data format & structure: What structure does the data have?
- Tabular data: (CSV, Excel, SQL databases) → good for classic ML models
- Text data: (chat histories, articles) → NLP models (Natural Language Processing)
- Image and video data: (JPG, PNG, MP4) → Convolutional Neural Networks (CNNs)
- Time series: (sensor values, share prices) → Recurrent Neural Networks
- Tip: Standardized formats facilitate processing!
Organize data storage & access
- Databases: SQL (structured) vs. NoSQL (flexible, e.g. MongoDB for JSON data)
- Cloud storage: AWS S3, Google Drive, Azure
- Local: CSVs, pickle files for fast processing
- Tip: Version data to be able to track changes!
Consider data protection & ethics:
- Anonymization & pseudonymization: (e.g. for personal data)
- Compliance with GDPR/DSGVO: If user data is processed
- Avoid bias: Is the data fairly distributed or does it favor certain groups?
- Tip: If sensitive data is used, check whether it can be aggregated or anonymized.
Conclusion:
- Data is the foundation of machine learning
- Bad data → bad models!
- A good data acquisition workflow saves a lot of time and significantly improves prediction quality.

Develop an approach to ensure that the quality of ML data is analyzed based on defined ML data quality criteria and activities are performed to support avoidance of biases of data.
Note 2: Examples of ML data quality criteria are relevant data sources, reliability and consistency of labelling, completeness against ML data requirements.
Note 3: The ML data management system should support the quality criteria and activities of the ML data quality approach.
Note 4: Biases to avoid may include sampling bias (e.g., gender, age) and feedback loop bias.
Note 5: For creation of ML data sets see MLE.3.BP2 and MLE.4.BP2.

Linked Knowledge Nuggets:
arrow_forward "Feedback loop bias explained"

arrow_forward "Important aspects for collecting data"

person Author: Process Fellows

Select data sources
- Where does the data come from?
  - Own databases
  - External APIs
  - Web scraping (if permitted)
  - Public data sets, e.g. Kaggle: online platform for data science and machine learning by Google; UCI ML repository: One of the oldest collection of public accessible data sets for machine learning by University of California, Irvine
- Your own measurements
- Tip: Check whether you have access to the data and whether it may be used under license law.
Ensure data quality: Good data is the key to successful machine learning! Garbage in means Garbage out means Bad Model!
- Completeness: Are important values missing? (e.g. missing prices in sales data)
- Consistency: Do units and formats match? (e.g. date in the same format)
- Correctness: Are there obvious errors? (e.g. negative prices)
- Representativeness: Does the data cover all possible cases? (Avoidance of bias!)
- Balance: Are there unbalanced data classes? (e.g. highly uneven distribution in a classification)
- Tip: Clean and normalize data before modeling! “Normalize” means: Pre-processing step: numerical values are transformed into a specific range (e.g. between 0 and 1 or -1 and 1). Aim is to eliminate scaling differences between features and thus improve the performance of ML algorithms. Example: Min-Max-scaling (but there are other approaches too)
Determine amount of data: How much data do you need?
- Supervised Learning: The more complex the model, the more data is required.
- Unsupervised Learning: Clustering can often work with less data.
- Deep Learning: Often requires a lot of data (e.g. millions of images for image recognition).
- Tip: If data is scarce, use “data augmentation” (e.g. rotate images, vary texts).
Data format & structure: What structure does the data have?
- Tabular data: (CSV, Excel, SQL databases) → good for classic ML models
- Text data: (chat histories, articles) → NLP models (Natural Language Processing)
- Image and video data: (JPG, PNG, MP4) → Convolutional Neural Networks (CNNs)
- Time series: (sensor values, share prices) → Recurrent Neural Networks
- Tip: Standardized formats facilitate processing!
Organize data storage & access
- Databases: SQL (structured) vs. NoSQL (flexible, e.g. MongoDB for JSON data)
- Cloud storage: AWS S3, Google Drive, Azure
- Local: CSVs, pickle files for fast processing
- Tip: Version data to be able to track changes!
Consider data protection & ethics:
- Anonymization & pseudonymization: (e.g. for personal data)
- Compliance with GDPR/DSGVO: If user data is processed
- Avoid bias: Is the data fairly distributed or does it favor certain groups?
- Tip: If sensitive data is used, check whether it can be aggregated or anonymized.
Conclusion:
- Data is the foundation of machine learning
- Bad data → bad models!
- A good data acquisition workflow saves a lot of time and significantly improves prediction quality.

Relevant sources for raw data are identified and continuously monitored for changes. The raw data is collected according to the ML data requirements.
Note 6: The identification and collection of ML data might be an organizational responsibility.
Note 7: Continuous monitoring should include the ODD and may lead to changes of the ML requirements.

Linked Knowledge Nuggets:
arrow_forward "Important aspects for collecting data"

person Author: Process Fellows

Select data sources
- Where does the data come from?
  - Own databases
  - External APIs
  - Web scraping (if permitted)
  - Public data sets, e.g. Kaggle: online platform for data science and machine learning by Google; UCI ML repository: One of the oldest collection of public accessible data sets for machine learning by University of California, Irvine
- Your own measurements
- Tip: Check whether you have access to the data and whether it may be used under license law.
Ensure data quality: Good data is the key to successful machine learning! Garbage in means Garbage out means Bad Model!
- Completeness: Are important values missing? (e.g. missing prices in sales data)
- Consistency: Do units and formats match? (e.g. date in the same format)
- Correctness: Are there obvious errors? (e.g. negative prices)
- Representativeness: Does the data cover all possible cases? (Avoidance of bias!)
- Balance: Are there unbalanced data classes? (e.g. highly uneven distribution in a classification)
- Tip: Clean and normalize data before modeling! “Normalize” means: Pre-processing step: numerical values are transformed into a specific range (e.g. between 0 and 1 or -1 and 1). Aim is to eliminate scaling differences between features and thus improve the performance of ML algorithms. Example: Min-Max-scaling (but there are other approaches too)
Determine amount of data: How much data do you need?
- Supervised Learning: The more complex the model, the more data is required.
- Unsupervised Learning: Clustering can often work with less data.
- Deep Learning: Often requires a lot of data (e.g. millions of images for image recognition).
- Tip: If data is scarce, use “data augmentation” (e.g. rotate images, vary texts).
Data format & structure: What structure does the data have?
- Tabular data: (CSV, Excel, SQL databases) → good for classic ML models
- Text data: (chat histories, articles) → NLP models (Natural Language Processing)
- Image and video data: (JPG, PNG, MP4) → Convolutional Neural Networks (CNNs)
- Time series: (sensor values, share prices) → Recurrent Neural Networks
- Tip: Standardized formats facilitate processing!
Organize data storage & access
- Databases: SQL (structured) vs. NoSQL (flexible, e.g. MongoDB for JSON data)
- Cloud storage: AWS S3, Google Drive, Azure
- Local: CSVs, pickle files for fast processing
- Tip: Version data to be able to track changes!
Consider data protection & ethics:
- Anonymization & pseudonymization: (e.g. for personal data)
- Compliance with GDPR/DSGVO: If user data is processed
- Avoid bias: Is the data fairly distributed or does it favor certain groups?
- Tip: If sensitive data is used, check whether it can be aggregated or anonymized.
Conclusion:
- Data is the foundation of machine learning
- Bad data → bad models!
- A good data acquisition workflow saves a lot of time and significantly improves prediction quality.

arrow_forward "What is a "hybrid dataset"?"

The raw data are processed (annotated, analyzed, and structured) according to the ML data requirements.

Perform the activities according to the ML data quality approach to ensure that the ML data meets the defined ML data quality criteria.
Note 8: These activities may include sample-based reviews or statistical methods.

Linked Knowledge Nuggets:
arrow_forward "Important aspects for collecting data"

person Author: Process Fellows

Select data sources
- Where does the data come from?
  - Own databases
  - External APIs
  - Web scraping (if permitted)
  - Public data sets, e.g. Kaggle: online platform for data science and machine learning by Google; UCI ML repository: One of the oldest collection of public accessible data sets for machine learning by University of California, Irvine
- Your own measurements
- Tip: Check whether you have access to the data and whether it may be used under license law.
Ensure data quality: Good data is the key to successful machine learning! Garbage in means Garbage out means Bad Model!
- Completeness: Are important values missing? (e.g. missing prices in sales data)
- Consistency: Do units and formats match? (e.g. date in the same format)
- Correctness: Are there obvious errors? (e.g. negative prices)
- Representativeness: Does the data cover all possible cases? (Avoidance of bias!)
- Balance: Are there unbalanced data classes? (e.g. highly uneven distribution in a classification)
- Tip: Clean and normalize data before modeling! “Normalize” means: Pre-processing step: numerical values are transformed into a specific range (e.g. between 0 and 1 or -1 and 1). Aim is to eliminate scaling differences between features and thus improve the performance of ML algorithms. Example: Min-Max-scaling (but there are other approaches too)
Determine amount of data: How much data do you need?
- Supervised Learning: The more complex the model, the more data is required.
- Unsupervised Learning: Clustering can often work with less data.
- Deep Learning: Often requires a lot of data (e.g. millions of images for image recognition).
- Tip: If data is scarce, use “data augmentation” (e.g. rotate images, vary texts).
Data format & structure: What structure does the data have?
- Tabular data: (CSV, Excel, SQL databases) → good for classic ML models
- Text data: (chat histories, articles) → NLP models (Natural Language Processing)
- Image and video data: (JPG, PNG, MP4) → Convolutional Neural Networks (CNNs)
- Time series: (sensor values, share prices) → Recurrent Neural Networks
- Tip: Standardized formats facilitate processing!
Organize data storage & access
- Databases: SQL (structured) vs. NoSQL (flexible, e.g. MongoDB for JSON data)
- Cloud storage: AWS S3, Google Drive, Azure
- Local: CSVs, pickle files for fast processing
- Tip: Version data to be able to track changes!
Consider data protection & ethics:
- Anonymization & pseudonymization: (e.g. for personal data)
- Compliance with GDPR/DSGVO: If user data is processed
- Avoid bias: Is the data fairly distributed or does it favor certain groups?
- Tip: If sensitive data is used, check whether it can be aggregated or anonymized.
Conclusion:
- Data is the foundation of machine learning
- Bad data → bad models!
- A good data acquisition workflow saves a lot of time and significantly improves prediction quality.

Inform all affected parties about the agreed processed ML data and provide them to the affected parties.

All forms of interpersonal communication such as
- e-mails, also automatically generated ones
- tool-supported workflows
- meeting, verbally or via meeting minutes (e.g., daily standups)
- podcast
- blog
- videos
- forum
- live chat
- wikis
- photo protocol

Used by these processes:

ACQ.4 Supplier Monitoring
HWE.1 Hardware Requirements Analysis
HWE.2 Hardware Design
HWE.3 Verification against Hardware Design
HWE.4 Verification against Hardware Requirements
MAN.3 Project Management
MLE.1 Machine Learning Requirements Analysis
MLE.2 Machine Learning Architecture
MLE.3 Machine Learning Training
MLE.4 Machine Learning Model Testing
PIM.3 Process Improvement
REU.2 Management of Products for Reuse
SUP.1 Quality Assurance
SUP.11 Machine Learning Data Management
SWE.1 Software Requirements Analysis
SWE.2 Software Architectural Design
SWE.3 Software Detailed Design and Unit Construction
SWE.4 Software Unit Verification
SWE.5 Software Component Verification and Integration Verification
SWE.6 Software Verification
SYS.1 Requirements Elicitation
SYS.2 System Requirements Analysis
SYS.3 System Architectural Design
SYS.4 System Integration and Integration Verification
SYS.5 System Verification
VAL.1 Validation

Used by these process attributes:

PA2.1 Performance Management

Datum to be used for . The datum has to be attributed by metadata, e.g., unique ID and data characteristics. Examples:

Visual data like a photo or videos (but a video could also be considered as sequence of photos depending on the intended use)
Audio recording
Sensor data
Data created by an algorithm
Data might be processed to create additional data. E.g., processing could add noise, change colors or merge pictures.

Used by these processes:

SUP.11 Machine Learning Data Management

The ML data management system is part of the configuration management system (see 16-03) and
Supports data management activities like data collection, description, ingestion, exploration, profiling, labeling/annotation, selection, structuring and cleansing
Provides the data for different purposes, e.g., training, testing
Supports the relevant sources of ML data

Used by these processes:

SUP.11 Machine Learning Data Management

The ML data quality approach
Defines Quality criteria (see 18-07) e.g., the relevant data sources, reliability and consistency of labelling, completeness against ML data requirementsv
Describes analysis activities of the data
Describes activities to ensure the quality of the data to avoid issues e.g., data bias, bad labeling

Used by these processes:

SUP.11 Machine Learning Data Management

SUP.11 Machine Learning Data Management