Linked Knowledge Nuggets: arrow_forward "AI in Automotive Systems: Aligning with ISO/PAS 8800"
personAuthor: Sebastian Keller
How can AI be safely integrated into the vehicles of tomorrow?
ISO/PAS 8800:2024 lays the foundation for managing artificial intelligence in safety-related automotive systems. Join our webinar and learn what the new specification means for managers, project leaders, quality specialists, and engineering teams.
We’ll demystify the key concepts behind AI safety, explain how ISO/PAS 8800 relates to ISO 26262, and show how organizations can prepare for the next generation of system validation and assurance.
You will gain an overview of how to transfer AI development activities into structured frameworks such as Automotive SPICE, define roles such as AI safety manager or data governance lead, and avoid common pitfalls such as uncontrolled uncontrolled data drift.
Reserve your spot today and lead your company's AI safety transformation with confidence.
school
Webinar recording and slides
# PROCESS PURPOSE
The purpose is to define and align ML data with ML data requirements, maintain the integrity and quality of the ML data, and make them available to affected parties.
# PROCESS OUTCOMES
O1
A ML data management system including an ML data lifecycle is established.
O2
A ML data quality approach is developed including ML data quality criteria.
O3
Collected ML data are processed for consistency with ML data requirements.
O4
ML data are verified against defined ML data quality criteria and updated as needed.
O5
ML data are agreed and communicated to all affected parties.
# BASE PRACTICES
BP1
Establish an ML data management system. (
O1 )
Establish an ML data management system which supports
ML data management activities,
relevant sources of ML data,
ML data life cycle including a status model, and
interfaces to affected parties.
Note 1: Supported ML data management activities may include data collection, labeling/annotation, and structuring.
Linked Knowledge Nuggets: arrow_forward "Important aspects for collecting data"
personAuthor: Process Fellows
Select data sources
Where does the data come from?
Own databases
External APIs
Web scraping (if permitted)
Public data sets, e.g.
Kaggle: online platform for data science and machine learning by Google;
UCI ML repository: One of the oldest collection of public accessible data sets for machine learning by University of California, Irvine
Your own measurements
Tip: Check whether you have access to the data and whether it may be used under license law.
Ensure data quality: Good data is the key to successful machine learning! Garbage in means Garbage out means Bad Model!
Completeness: Are important values missing? (e.g. missing prices in sales data)
Consistency: Do units and formats match? (e.g. date in the same format)
Correctness: Are there obvious errors? (e.g. negative prices)
Representativeness: Does the data cover all possible cases? (Avoidance of bias!)
Balance: Are there unbalanced data classes? (e.g. highly uneven distribution in a classification)
Tip: Clean and normalize data before modeling!
“Normalize” means:
Pre-processing step: numerical values are transformed into a specific range (e.g. between 0 and 1 or -1 and 1).
Aim is to eliminate scaling differences between features and thus improve the performance of ML algorithms.
Example: Min-Max-scaling (but there are other approaches too)
Determine amount of data:
How much data do you need?
Supervised Learning: The more complex the model, the more data is required.
Unsupervised Learning: Clustering can often work with less data.
Deep Learning: Often requires a lot of data (e.g. millions of images for image recognition).
Tip: If data is scarce, use “data augmentation” (e.g. rotate images, vary texts).
Data format & structure:
What structure does the data have?
Tabular data: (CSV, Excel, SQL databases) → good for classic ML models
Text data: (chat histories, articles) → NLP models (Natural Language Processing)
Image and video data: (JPG, PNG, MP4) → Convolutional Neural Networks (CNNs)
Time series: (sensor values, share prices) → Recurrent Neural Networks
Tip: Standardized formats facilitate processing!
Organize data storage & access
Databases: SQL (structured) vs. NoSQL (flexible, e.g. MongoDB for JSON data)
Cloud storage: AWS S3, Google Drive, Azure
Local: CSVs, pickle files for fast processing
Tip: Version data to be able to track changes!
Consider data protection & ethics:
Anonymization & pseudonymization: (e.g. for personal data)
Compliance with GDPR/DSGVO: If user data is processed
Avoid bias: Is the data fairly distributed or does it favor certain groups?
Tip: If sensitive data is used, check whether it can be aggregated or anonymized.
Conclusion:
Data is the foundation of machine learning
Bad data → bad models!
A good data acquisition workflow saves a lot of time and significantly
improves prediction quality.
BP2
Develop an ML data quality approach. (
O2 )
Develop an approach to ensure that the quality of ML data is analyzed based on defined ML data quality criteria and activities are performed to support avoidance of biases of data. Note 2: Examples of ML data quality criteria are relevant data sources, reliability and consistency of labelling, completeness against ML data requirements. Note 3: The ML data management system should support the quality criteria and activities of the ML data quality approach. Note 4: Biases to avoid may include sampling bias (e.g., gender, age) and feedback loop bias. Note 5: For creation of ML data sets see MLE.3.BP2 and MLE.4.BP2.
personAuthor: Process Fellows
Examples of self-reinforcing predictions (and their potential solution):
Personalized advertising: An algorithm shows a user more products from a certain category because they have clicked on it once. As a result, other potentially relevant products are no longer suggested.
Crime prediction: A model for combating crime prioritizes areas with a high police presence. More police officers in this area report more crimes, which leads the model to believe that even more checks are needed there.
Social media & filter bubbles: Recommendation algorithms preferentially show content that users already like, resulting in them only seeing biased perspectives.
Solutions to mitigate feedback loop bias
Random exploration: Incorporate random recommendations or decisions to consider new data sources.
External data sources: Incorporate not only past model decisions, but also independent data sources into training.
Fairness checks: Perform bias measurements to identify whether certain groups or information are systematically disadvantaged.
Model refresh: Regular retraining with new, more diverse data to reduce bias.
arrow_forward "Important aspects for collecting data"
personAuthor: Process Fellows
Select data sources
Where does the data come from?
Own databases
External APIs
Web scraping (if permitted)
Public data sets, e.g.
Kaggle: online platform for data science and machine learning by Google;
UCI ML repository: One of the oldest collection of public accessible data sets for machine learning by University of California, Irvine
Your own measurements
Tip: Check whether you have access to the data and whether it may be used under license law.
Ensure data quality: Good data is the key to successful machine learning! Garbage in means Garbage out means Bad Model!
Completeness: Are important values missing? (e.g. missing prices in sales data)
Consistency: Do units and formats match? (e.g. date in the same format)
Correctness: Are there obvious errors? (e.g. negative prices)
Representativeness: Does the data cover all possible cases? (Avoidance of bias!)
Balance: Are there unbalanced data classes? (e.g. highly uneven distribution in a classification)
Tip: Clean and normalize data before modeling!
“Normalize” means:
Pre-processing step: numerical values are transformed into a specific range (e.g. between 0 and 1 or -1 and 1).
Aim is to eliminate scaling differences between features and thus improve the performance of ML algorithms.
Example: Min-Max-scaling (but there are other approaches too)
Determine amount of data:
How much data do you need?
Supervised Learning: The more complex the model, the more data is required.
Unsupervised Learning: Clustering can often work with less data.
Deep Learning: Often requires a lot of data (e.g. millions of images for image recognition).
Tip: If data is scarce, use “data augmentation” (e.g. rotate images, vary texts).
Data format & structure:
What structure does the data have?
Tabular data: (CSV, Excel, SQL databases) → good for classic ML models
Text data: (chat histories, articles) → NLP models (Natural Language Processing)
Image and video data: (JPG, PNG, MP4) → Convolutional Neural Networks (CNNs)
Time series: (sensor values, share prices) → Recurrent Neural Networks
Tip: Standardized formats facilitate processing!
Organize data storage & access
Databases: SQL (structured) vs. NoSQL (flexible, e.g. MongoDB for JSON data)
Cloud storage: AWS S3, Google Drive, Azure
Local: CSVs, pickle files for fast processing
Tip: Version data to be able to track changes!
Consider data protection & ethics:
Anonymization & pseudonymization: (e.g. for personal data)
Compliance with GDPR/DSGVO: If user data is processed
Avoid bias: Is the data fairly distributed or does it favor certain groups?
Tip: If sensitive data is used, check whether it can be aggregated or anonymized.
Conclusion:
Data is the foundation of machine learning
Bad data → bad models!
A good data acquisition workflow saves a lot of time and significantly
improves prediction quality.
BP3
Collect ML data. (
O3 )
Relevant sources for raw data are identified and continuously monitored for changes. The raw data is collected according to the ML data requirements. Note 6: The identification and collection of ML data might be an organizational responsibility. Note 7: Continuous monitoring should include the ODD and may lead to changes of the ML requirements.
Linked Knowledge Nuggets: arrow_forward "Important aspects for collecting data"
personAuthor: Process Fellows
Select data sources
Where does the data come from?
Own databases
External APIs
Web scraping (if permitted)
Public data sets, e.g.
Kaggle: online platform for data science and machine learning by Google;
UCI ML repository: One of the oldest collection of public accessible data sets for machine learning by University of California, Irvine
Your own measurements
Tip: Check whether you have access to the data and whether it may be used under license law.
Ensure data quality: Good data is the key to successful machine learning! Garbage in means Garbage out means Bad Model!
Completeness: Are important values missing? (e.g. missing prices in sales data)
Consistency: Do units and formats match? (e.g. date in the same format)
Correctness: Are there obvious errors? (e.g. negative prices)
Representativeness: Does the data cover all possible cases? (Avoidance of bias!)
Balance: Are there unbalanced data classes? (e.g. highly uneven distribution in a classification)
Tip: Clean and normalize data before modeling!
“Normalize” means:
Pre-processing step: numerical values are transformed into a specific range (e.g. between 0 and 1 or -1 and 1).
Aim is to eliminate scaling differences between features and thus improve the performance of ML algorithms.
Example: Min-Max-scaling (but there are other approaches too)
Determine amount of data:
How much data do you need?
Supervised Learning: The more complex the model, the more data is required.
Unsupervised Learning: Clustering can often work with less data.
Deep Learning: Often requires a lot of data (e.g. millions of images for image recognition).
Tip: If data is scarce, use “data augmentation” (e.g. rotate images, vary texts).
Data format & structure:
What structure does the data have?
Tabular data: (CSV, Excel, SQL databases) → good for classic ML models
Text data: (chat histories, articles) → NLP models (Natural Language Processing)
Image and video data: (JPG, PNG, MP4) → Convolutional Neural Networks (CNNs)
Time series: (sensor values, share prices) → Recurrent Neural Networks
Tip: Standardized formats facilitate processing!
Organize data storage & access
Databases: SQL (structured) vs. NoSQL (flexible, e.g. MongoDB for JSON data)
Cloud storage: AWS S3, Google Drive, Azure
Local: CSVs, pickle files for fast processing
Tip: Version data to be able to track changes!
Consider data protection & ethics:
Anonymization & pseudonymization: (e.g. for personal data)
Compliance with GDPR/DSGVO: If user data is processed
Avoid bias: Is the data fairly distributed or does it favor certain groups?
Tip: If sensitive data is used, check whether it can be aggregated or anonymized.
Conclusion:
Data is the foundation of machine learning
Bad data → bad models!
A good data acquisition workflow saves a lot of time and significantly
improves prediction quality.
arrow_forward "What is a "hybrid dataset"?"
personAuthor: Process Fellows
A hybrid dataset is a dataset comprising data elements that are both real-world data elements (e.g. created from sensors) and synthetic generated data elements (i.e. no real-world data).
BP4
Process ML data. (
O3 )
The raw data are processed (annotated, analyzed, and structured) according to the ML data requirements.
BP5
Assure quality of ML data. (
O4 )
Perform the activities according to the ML data quality approach to ensure that the ML data meets the defined ML data quality criteria. Note 8: These activities may include sample-based reviews or statistical methods.
Linked Knowledge Nuggets: arrow_forward "Important aspects for collecting data"
personAuthor: Process Fellows
Select data sources
Where does the data come from?
Own databases
External APIs
Web scraping (if permitted)
Public data sets, e.g.
Kaggle: online platform for data science and machine learning by Google;
UCI ML repository: One of the oldest collection of public accessible data sets for machine learning by University of California, Irvine
Your own measurements
Tip: Check whether you have access to the data and whether it may be used under license law.
Ensure data quality: Good data is the key to successful machine learning! Garbage in means Garbage out means Bad Model!
Completeness: Are important values missing? (e.g. missing prices in sales data)
Consistency: Do units and formats match? (e.g. date in the same format)
Correctness: Are there obvious errors? (e.g. negative prices)
Representativeness: Does the data cover all possible cases? (Avoidance of bias!)
Balance: Are there unbalanced data classes? (e.g. highly uneven distribution in a classification)
Tip: Clean and normalize data before modeling!
“Normalize” means:
Pre-processing step: numerical values are transformed into a specific range (e.g. between 0 and 1 or -1 and 1).
Aim is to eliminate scaling differences between features and thus improve the performance of ML algorithms.
Example: Min-Max-scaling (but there are other approaches too)
Determine amount of data:
How much data do you need?
Supervised Learning: The more complex the model, the more data is required.
Unsupervised Learning: Clustering can often work with less data.
Deep Learning: Often requires a lot of data (e.g. millions of images for image recognition).
Tip: If data is scarce, use “data augmentation” (e.g. rotate images, vary texts).
Data format & structure:
What structure does the data have?
Tabular data: (CSV, Excel, SQL databases) → good for classic ML models
Text data: (chat histories, articles) → NLP models (Natural Language Processing)
Image and video data: (JPG, PNG, MP4) → Convolutional Neural Networks (CNNs)
Time series: (sensor values, share prices) → Recurrent Neural Networks
Tip: Standardized formats facilitate processing!
Organize data storage & access
Databases: SQL (structured) vs. NoSQL (flexible, e.g. MongoDB for JSON data)
Cloud storage: AWS S3, Google Drive, Azure
Local: CSVs, pickle files for fast processing
Tip: Version data to be able to track changes!
Consider data protection & ethics:
Anonymization & pseudonymization: (e.g. for personal data)
Compliance with GDPR/DSGVO: If user data is processed
Avoid bias: Is the data fairly distributed or does it favor certain groups?
Tip: If sensitive data is used, check whether it can be aggregated or anonymized.
Conclusion:
Data is the foundation of machine learning
Bad data → bad models!
A good data acquisition workflow saves a lot of time and significantly
improves prediction quality.
BP6
Communicate agreed processed ML data. (
O5 )
Inform all affected parties about the agreed processed ML data and provide them to the affected parties.
# OUTPUT INFORMATION ITEMS
13-52
Communication evidence (
O5 )
All forms of interpersonal communication such as
e-mails, also automatically generated ones
tool-supported workflows
meeting, verbally or via meeting minutes (e.g., daily standups)
podcast
blog
videos
forum
live chat
wikis
photo protocol
Used by these processes:
ACQ.4 Supplier Monitoring
HWE.1 Hardware Requirements Analysis
HWE.2 Hardware Design
HWE.3 Verification against Hardware Design
HWE.4 Verification against Hardware Requirements
MAN.3 Project Management
MLE.1 Machine Learning Requirements Analysis
MLE.2 Machine Learning Architecture
MLE.3 Machine Learning Training
MLE.4 Machine Learning Model Testing
PIM.3 Process Improvement
REU.2 Management of Products for Reuse
SUP.1 Quality Assurance
SUP.11 Machine Learning Data Management
SWE.1 Software Requirements Analysis
SWE.2 Software Architectural Design
SWE.3 Software Detailed Design and Unit Construction
SWE.4 Software Unit Verification
SWE.5 Software Component Verification and Integration Verification
SWE.6 Software Verification
SYS.1 Requirements Elicitation
SYS.2 System Requirements Analysis
SYS.3 System Architectural Design
SYS.4 System Integration and Integration Verification
SYS.5 System Verification
VAL.1 Validation
Used by these process attributes:
PA2.1 Performance Management
03-53
ML data (
O3, O4 )
Datum to be used for (Machine Learning = In Automotive SPICE Machine Learning (ML) describes the ability of software to learn from specific training data and to apply this knowledge to other similar tasks.). The datum has to be attributed by metadata, e.g., unique ID and data characteristics. Examples:
Visual data like a photo or videos (but a video could also be considered as sequence of photos depending on the intended use)
Audio recording
Sensor data
Data created by an algorithm
Data might be processed to create additional data. E.g., processing could add noise, change colors or merge pictures.
Used by these processes:
SUP.11 Machine Learning Data Management
16-52
ML data management system (
O1 )
The ML data management system is part of the configuration management system (see 16-03) and
Supports data management activities like data collection, description, ingestion, exploration, profiling, labeling/annotation, selection, structuring and cleansing
Provides the data for different purposes, e.g., training, testing
Supports the relevant sources of ML data
Used by these processes:
SUP.11 Machine Learning Data Management
19-50
ML data quality approach (
O2 )
The ML data quality approach
Defines Quality criteria (see 18-07) e.g., the relevant data sources, reliability and consistency of labelling, completeness against ML data requirementsv
Describes analysis activities of the data
Describes activities to ensure the quality of the data to avoid issues e.g., data bias, bad labeling