Knowledge Nugget

Important aspects for collecting data
person Author: Process Fellows
  1. Select data sources
    • Where does the data come from?
      • Own databases
      • External APIs
      • Web scraping (if permitted)
      • Public data sets, e.g. Kaggle: online platform for data science and machine learning by Google; UCI ML repository: One of the oldest collection of public accessible data sets for machine learning by University of California, Irvine
    • Your own measurements
    • Tip: Check whether you have access to the data and whether it may be used under license law.
  2. Ensure data quality: Good data is the key to successful machine learning! Garbage in means Garbage out means Bad Model!
    • Completeness: Are important values missing? (e.g. missing prices in sales data)
    • Consistency: Do units and formats match? (e.g. date in the same format)
    • Correctness: Are there obvious errors? (e.g. negative prices)
    • Representativeness: Does the data cover all possible cases? (Avoidance of bias!)
    • Balance: Are there unbalanced data classes? (e.g. highly uneven distribution in a classification)
    • Tip: Clean and normalize data before modeling! “Normalize” means: Pre-processing step: numerical values are transformed into a specific range (e.g. between 0 and 1 or -1 and 1). Aim is to eliminate scaling differences between features and thus improve the performance of ML algorithms. Example: Min-Max-scaling (but there are other approaches too)
  3. Determine amount of data: How much data do you need?
    • Supervised Learning: The more complex the model, the more data is required.
    • Unsupervised Learning: Clustering can often work with less data.
    • Deep Learning: Often requires a lot of data (e.g. millions of images for image recognition).
    • Tip: If data is scarce, use “data augmentation” (e.g. rotate images, vary texts).
  4. Data format & structure: What structure does the data have?
    • Tabular data: (CSV, Excel, SQL databases) → good for classic ML models
    • Text data: (chat histories, articles) → NLP models (Natural Language Processing)
    • Image and video data: (JPG, PNG, MP4) → Convolutional Neural Networks (CNNs)
    • Time series: (sensor values, share prices) → Recurrent Neural Networks
    • Tip: Standardized formats facilitate processing!
  5. Organize data storage & access
    • Databases: SQL (structured) vs. NoSQL (flexible, e.g. MongoDB for JSON data)
    • Cloud storage: AWS S3, Google Drive, Azure
    • Local: CSVs, pickle files for fast processing
    • Tip: Version data to be able to track changes!
  6. Consider data protection & ethics:
    • Anonymization & pseudonymization: (e.g. for personal data)
    • Compliance with GDPR/DSGVO: If user data is processed
    • Avoid bias: Is the data fairly distributed or does it favor certain groups?
    • Tip: If sensitive data is used, check whether it can be aggregated or anonymized.

    Conclusion:

    • Data is the foundation of machine learning
    • Bad data → bad models!
    • A good data acquisition workflow saves a lot of time and significantly improves prediction quality.
Mapped with these items:
  • Automotive SPICE 4.0
    • SUP.11.BP1 Establish an ML data management system.
    • SUP.11.BP2 Develop an ML data quality approach.
    • SUP.11.BP3 Collect ML data.
    • SUP.11.BP5 Assure quality of ML data.