Skip to content

Chapter 8: Data Collection & Management

🎓 Learning Objectives

  • Understand data collection strategies
  • Learn data preprocessing techniques
  • Master data organization and storage
  • Understand data versioning and documentation
  • Learn about data ethics and privacy

Data in Research

Data is the foundation of ML research. Quality data is essential for:

  • Valid experimental results
  • Reproducible research
  • Generalizable findings
  • Credible publications

Data Importance

"Garbage in, garbage out" - Poor data leads to poor results, regardless of model quality.

Data Collection

Sources of Data

1. Public Datasets: - Academic datasets (ImageNet, COCO, etc.) - Government data - Open data initiatives - Competition datasets

2. Collected Data: - Surveys - Experiments - Observations - Simulations

3. Synthetic Data: - Generated data - Augmented data - Simulated environments

Dataset Selection

  • Use standard benchmarks for comparison
  • Check data quality and size
  • Verify licensing and usage rights
  • Consider data diversity

Dataset Evaluation

Quality Checks: - Completeness: Missing values? - Accuracy: Correct labels? - Consistency: Format issues? - Relevance: Appropriate for problem? - Size: Sufficient for conclusions?

Data Quality

Always inspect data before use. Quality issues can invalidate results.

Data Preprocessing

Common Preprocessing Steps

1. Cleaning: - Remove duplicates - Handle missing values - Fix errors - Remove outliers (if appropriate)

2. Normalization: - Standardization (z-score) - Min-max scaling - Unit normalization

3. Transformation: - Feature engineering - Encoding (one-hot, label) - Dimensionality reduction

4. Splitting: - Train/validation/test splits - Stratification - Time-based splits

Preprocessing Order

  • Split data FIRST
  • Then preprocess separately
  • Fit on training, transform all

Preprocessing Best Practices

1. Split Before Preprocessing:

# Correct order
train, val, test = split_data(data)
scaler = fit_scaler(train)  # Fit on train only
train_scaled = scaler.transform(train)
val_scaled = scaler.transform(val)
test_scaled = scaler.transform(test)

Data Leakage

Fitting preprocessing on full dataset leaks information. Always fit on training set only.

2. Document Preprocessing: - Record all transformations - Save preprocessing code - Document parameters - Version preprocessing pipeline

3. Reproducibility: - Set random seeds - Save preprocessing scripts - Document versions - Use version control

Data Organization

Directory Structure

Recommended Structure:

data/
├── raw/              # Original, unprocessed data
├── processed/        # Preprocessed data
├── splits/           # Train/val/test splits
├── metadata/         # Data documentation
└── scripts/          # Preprocessing scripts

Organization

  • Keep raw data separate
  • Version processed data
  • Document everything
  • Use consistent naming

Naming Conventions

Best Practices: - Descriptive names - Include version numbers - Consistent format - Avoid spaces/special chars

Examples: - coco_train_2017_v1.0.parquet - imagenet_val_processed_v2.1.h5 - custom_dataset_v1.0_raw.zip

Data Storage

Storage Formats

Text Data: - CSV (small datasets) - JSON (structured data) - Parquet (efficient, large datasets)

Image Data: - Individual files (JPG, PNG) - HDF5 (efficient, large datasets) - TFRecord (TensorFlow) - LMDB (fast access)

Tabular Data: - CSV (small) - Parquet (large, efficient) - HDF5 (hierarchical)

Format Choice

Choose format based on: - Data size - Access patterns - Tool compatibility - Efficiency needs

Storage Best Practices

1. Version Control: - Use DVC (Data Version Control) - Git LFS for small files - Cloud storage with versioning

2. Backup: - Multiple locations - Regular backups - Test restoration

3. Access Control: - Permissions management - Secure storage - Audit logs

Data Versioning

Why Version Data?

Benefits: - Reproducibility - Track changes - Rollback if needed - Collaboration

Versioning Tools

DVC (Data Version Control): - Git-like for data - Efficient storage - Reproducible pipelines - Cloud integration

Git LFS: - Large file support - Git integration - Version tracking

Data Versioning

Use DVC for data versioning. It's designed for ML workflows.

Versioning Strategy

Version Naming: - Semantic versioning (v1.0.0) - Date-based (2024-01-15) - Descriptive (v1.0.0-cleaned)

Version Documentation: - Changelog - What changed - Why changed - Who changed

Data Documentation

What to Document

Dataset Information: - Source and origin - Collection method - Size and statistics - License and usage

Preprocessing: - Steps taken - Parameters used - Code/scripts - Versions

Splits: - Split strategy - Split ratios - Stratification info - Random seeds

Documentation

Good documentation enables reproducibility and collaboration.

Documentation Format

README.md for Dataset:

# Dataset Name

## Overview
Brief description

## Source
Where data came from

## Statistics
- Size: X samples
- Classes: Y
- Format: ...

## Preprocessing
Steps taken, parameters

## Splits
Train/val/test ratios, strategy

## Usage
How to load and use

## License
Usage rights

Data Ethics and Privacy

Ethical Considerations

1. Consent: - Informed consent - Clear purpose - Right to withdraw

2. Privacy: - Anonymization - Data minimization - Secure storage - Access control

3. Bias: - Check for bias - Document limitations - Fair representation - Mitigation strategies

Ethics

Research ethics are critical. Always consider: - Privacy and consent - Bias and fairness - Data usage rights - Potential harm

Privacy Protection

Techniques: - Anonymization: Remove identifiers - Pseudonymization: Replace identifiers - Differential Privacy: Add noise - Federated Learning: Keep data local

Privacy

Protect participant privacy. Follow regulations (GDPR, etc.).

Data Quality Assurance

Quality Checks

1. Validation: - Schema validation - Range checks - Format validation - Completeness checks

2. Statistics: - Distribution analysis - Outlier detection - Missing value analysis - Correlation analysis

3. Visualization: - Data distributions - Sample inspection - Quality plots - Error analysis

Quality Checks

Always validate data quality. Automated checks catch issues early.

Data Management Tools

Data Versioning

DVC: - Data version control - Pipeline management - Cloud storage integration

Pachyderm: - Data versioning - Pipeline automation - Reproducibility

Data Storage

Cloud Storage: - AWS S3 - Google Cloud Storage - Azure Blob Storage

Local Storage: - Network attached storage - Local drives - External drives

Data Processing

Pandas: - Data manipulation - Analysis - Cleaning

Dask: - Large dataset processing - Parallel computing - Distributed processing

Resources

📚 Data Management
  1. Data Version Control (DVC) - DVC documentation
  2. Data Management Guide - DataONE best practices
  3. Research Data Management - UK Data Service
🛠️ Tools
  1. DVC - Data version control
  2. Pandas - Data manipulation
  3. Dask - Large data processing
  4. Great Expectations - Data validation
📊 Datasets
  1. Papers With Code Datasets - Dataset collection
  2. Kaggle Datasets - Community datasets
  3. UCI ML Repository - Classic datasets

Next Steps


Key Takeaways: - Data quality is critical for research validity - Split data before preprocessing to avoid leakage - Organize data systematically with clear structure - Version data for reproducibility - Document data thoroughly - Consider ethics and privacy - Validate data quality before use