Important aspects for collecting data

Question

Accepted Answer

Select data sources
- Where does the data come from?
  - Own databases
  - External APIs
  - Web scraping (if permitted)
  - Public data sets, e.g. Kaggle: online platform for data science and machine learning by Google; UCI ML repository: One of the oldest collection of public accessible data sets for machine learning by University of California, Irvine
- Your own measurements
- Tip: Check whether you have access to the data and whether it may be used under license law.
Ensure data quality: Good data is the key to successful machine learning! Garbage in means Garbage out means Bad Model!
- Completeness: Are important values missing? (e.g. missing prices in sales data)
- Consistency: Do units and formats match? (e.g. date in the same format)
- Correctness: Are there obvious errors? (e.g. negative prices)
- Representativeness: Does the data cover all possible cases? (Avoidance of bias!)
- Balance: Are there unbalanced data classes? (e.g. highly uneven distribution in a classification)
- Tip: Clean and normalize data before modeling! “Normalize” means: Pre-processing step: numerical values are transformed into a specific range (e.g. between 0 and 1 or -1 and 1). Aim is to eliminate scaling differences between features and thus improve the performance of ML algorithms. Example: Min-Max-scaling (but there are other approaches too)
Determine amount of data: How much data do you need?
- Supervised Learning: The more complex the model, the more data is required.
- Unsupervised Learning: Clustering can often work with less data.
- Deep Learning: Often requires a lot of data (e.g. millions of images for image recognition).
- Tip: If data is scarce, use “data augmentation” (e.g. rotate images, vary texts).
Data format & structure: What structure does the data have?
- Tabular data: (CSV, Excel, SQL databases) → good for classic ML models
- Text data: (chat histories, articles) → NLP models (Natural Language Processing)
- Image and video data: (JPG, PNG, MP4) → Convolutional Neural Networks (CNNs)
- Time series: (sensor values, share prices) → Recurrent Neural Networks
- Tip: Standardized formats facilitate processing!
Organize data storage & access
- Databases: SQL (structured) vs. NoSQL (flexible, e.g. MongoDB for JSON data)
- Cloud storage: AWS S3, Google Drive, Azure
- Local: CSVs, pickle files for fast processing
- Tip: Version data to be able to track changes!
Consider data protection & ethics:
- Anonymization & pseudonymization: (e.g. for personal data)
- Compliance with GDPR/DSGVO: If user data is processed
- Avoid bias: Is the data fairly distributed or does it favor certain groups?
- Tip: If sensitive data is used, check whether it can be aggregated or anonymized.
Conclusion:
- Data is the foundation of machine learning
- Bad data → bad models!
- A good data acquisition workflow saves a lot of time and significantly improves prediction quality.

Knowledge Nugget