Chapter 8: Data Collection & Management¶
🎓 Learning Objectives
- Understand data collection strategies
- Learn data preprocessing techniques
- Master data organization and storage
- Understand data versioning and documentation
- Learn about data ethics and privacy
Data in Research¶
Data is the foundation of ML research. Quality data is essential for:
- Valid experimental results
- Reproducible research
- Generalizable findings
- Credible publications
Data Importance
"Garbage in, garbage out" - Poor data leads to poor results, regardless of model quality.
Data Collection¶
Sources of Data¶
1. Public Datasets: - Academic datasets (ImageNet, COCO, etc.) - Government data - Open data initiatives - Competition datasets
2. Collected Data: - Surveys - Experiments - Observations - Simulations
3. Synthetic Data: - Generated data - Augmented data - Simulated environments
Dataset Selection
- Use standard benchmarks for comparison
- Check data quality and size
- Verify licensing and usage rights
- Consider data diversity
Dataset Evaluation¶
Quality Checks: - Completeness: Missing values? - Accuracy: Correct labels? - Consistency: Format issues? - Relevance: Appropriate for problem? - Size: Sufficient for conclusions?
Data Quality
Always inspect data before use. Quality issues can invalidate results.
Data Preprocessing¶
Common Preprocessing Steps¶
1. Cleaning: - Remove duplicates - Handle missing values - Fix errors - Remove outliers (if appropriate)
2. Normalization: - Standardization (z-score) - Min-max scaling - Unit normalization
3. Transformation: - Feature engineering - Encoding (one-hot, label) - Dimensionality reduction
4. Splitting: - Train/validation/test splits - Stratification - Time-based splits
Preprocessing Order
- Split data FIRST
- Then preprocess separately
- Fit on training, transform all
Preprocessing Best Practices¶
1. Split Before Preprocessing:
# Correct order
train, val, test = split_data(data)
scaler = fit_scaler(train) # Fit on train only
train_scaled = scaler.transform(train)
val_scaled = scaler.transform(val)
test_scaled = scaler.transform(test)
Data Leakage
Fitting preprocessing on full dataset leaks information. Always fit on training set only.
2. Document Preprocessing: - Record all transformations - Save preprocessing code - Document parameters - Version preprocessing pipeline
3. Reproducibility: - Set random seeds - Save preprocessing scripts - Document versions - Use version control
Data Organization¶
Directory Structure¶
Recommended Structure:
data/
├── raw/ # Original, unprocessed data
├── processed/ # Preprocessed data
├── splits/ # Train/val/test splits
├── metadata/ # Data documentation
└── scripts/ # Preprocessing scripts
Organization
- Keep raw data separate
- Version processed data
- Document everything
- Use consistent naming
Naming Conventions¶
Best Practices: - Descriptive names - Include version numbers - Consistent format - Avoid spaces/special chars
Examples:
- coco_train_2017_v1.0.parquet
- imagenet_val_processed_v2.1.h5
- custom_dataset_v1.0_raw.zip
Data Storage¶
Storage Formats¶
Text Data: - CSV (small datasets) - JSON (structured data) - Parquet (efficient, large datasets)
Image Data: - Individual files (JPG, PNG) - HDF5 (efficient, large datasets) - TFRecord (TensorFlow) - LMDB (fast access)
Tabular Data: - CSV (small) - Parquet (large, efficient) - HDF5 (hierarchical)
Format Choice
Choose format based on: - Data size - Access patterns - Tool compatibility - Efficiency needs
Storage Best Practices¶
1. Version Control: - Use DVC (Data Version Control) - Git LFS for small files - Cloud storage with versioning
2. Backup: - Multiple locations - Regular backups - Test restoration
3. Access Control: - Permissions management - Secure storage - Audit logs
Data Versioning¶
Why Version Data?¶
Benefits: - Reproducibility - Track changes - Rollback if needed - Collaboration
Versioning Tools¶
DVC (Data Version Control): - Git-like for data - Efficient storage - Reproducible pipelines - Cloud integration
Git LFS: - Large file support - Git integration - Version tracking
Data Versioning
Use DVC for data versioning. It's designed for ML workflows.
Versioning Strategy¶
Version Naming: - Semantic versioning (v1.0.0) - Date-based (2024-01-15) - Descriptive (v1.0.0-cleaned)
Version Documentation: - Changelog - What changed - Why changed - Who changed
Data Documentation¶
What to Document¶
Dataset Information: - Source and origin - Collection method - Size and statistics - License and usage
Preprocessing: - Steps taken - Parameters used - Code/scripts - Versions
Splits: - Split strategy - Split ratios - Stratification info - Random seeds
Documentation
Good documentation enables reproducibility and collaboration.
Documentation Format¶
README.md for Dataset:
# Dataset Name
## Overview
Brief description
## Source
Where data came from
## Statistics
- Size: X samples
- Classes: Y
- Format: ...
## Preprocessing
Steps taken, parameters
## Splits
Train/val/test ratios, strategy
## Usage
How to load and use
## License
Usage rights
Data Ethics and Privacy¶
Ethical Considerations¶
1. Consent: - Informed consent - Clear purpose - Right to withdraw
2. Privacy: - Anonymization - Data minimization - Secure storage - Access control
3. Bias: - Check for bias - Document limitations - Fair representation - Mitigation strategies
Ethics
Research ethics are critical. Always consider: - Privacy and consent - Bias and fairness - Data usage rights - Potential harm
Privacy Protection¶
Techniques: - Anonymization: Remove identifiers - Pseudonymization: Replace identifiers - Differential Privacy: Add noise - Federated Learning: Keep data local
Privacy
Protect participant privacy. Follow regulations (GDPR, etc.).
Data Quality Assurance¶
Quality Checks¶
1. Validation: - Schema validation - Range checks - Format validation - Completeness checks
2. Statistics: - Distribution analysis - Outlier detection - Missing value analysis - Correlation analysis
3. Visualization: - Data distributions - Sample inspection - Quality plots - Error analysis
Quality Checks
Always validate data quality. Automated checks catch issues early.
Data Management Tools¶
Data Versioning¶
DVC: - Data version control - Pipeline management - Cloud storage integration
Pachyderm: - Data versioning - Pipeline automation - Reproducibility
Data Storage¶
Cloud Storage: - AWS S3 - Google Cloud Storage - Azure Blob Storage
Local Storage: - Network attached storage - Local drives - External drives
Data Processing¶
Pandas: - Data manipulation - Analysis - Cleaning
Dask: - Large dataset processing - Parallel computing - Distributed processing
Resources¶
📚 Data Management
- Data Version Control (DVC) - DVC documentation
- Data Management Guide - DataONE best practices
- Research Data Management - UK Data Service
🛠️ Tools
- DVC - Data version control
- Pandas - Data manipulation
- Dask - Large data processing
- Great Expectations - Data validation
📊 Datasets
- Papers With Code Datasets - Dataset collection
- Kaggle Datasets - Community datasets
- UCI ML Repository - Classic datasets
Next Steps¶
- Chapter 9: Reproducing Research Papers - Code reproduction
- Chapter 10: Research Tools & Platforms - Essential tools
Key Takeaways: - Data quality is critical for research validity - Split data before preprocessing to avoid leakage - Organize data systematically with clear structure - Version data for reproducibility - Document data thoroughly - Consider ethics and privacy - Validate data quality before use