Apache Airflow ETL Pipeline
data_engineering
Python
architecture
strict_senior
Production-ready ETL pipeline with error handling, retry logic, and data validation.
By emily_r
12/8/2025
Prompt
Apache Airflow ETL Pipeline
Create a production-ready ETL pipeline using Apache Airflow.
Pipeline Stages
1. Extract
- Data source: [Source: API/Database/S3]
- Handle pagination for APIs
- Incremental data extraction
2. Transform
- Use pandas for data manipulation:
- Data cleaning (null handling, duplicates)
- Aggregations and calculations
- Joining multiple datasets
- Data validation and quality checks
3. Load
- Destination: [Destination: Data Warehouse/Database]
- Batch inserts for performance
- Upsert logic for updates
DAG Configuration
Task Dependencies
- Clear dependency graph
- Use
>>operator for readability
Retry Logic
- Exponential backoff strategy
- Maximum retry attempts
- Retry delay configuration
SLA Monitoring
- Set task SLAs
- Monitor pipeline execution time
Alerts
- Email notifications on failure
- Custom alert callbacks
- Slack/PagerDuty integration
TaskGroups
- Organize related tasks
- Improve DAG visualization
- Logical grouping of operations
Data Quality Checks
Validation Tasks
- Row count validation - Ensure expected data volume
- Schema validation - Verify column names and types
- Null checks - Critical fields not null
- Data range checks - Values within expected ranges
Best Practices
Idempotency
- Safe to re-run tasks
- Deterministic outcomes
- Handle duplicates gracefully
Logging
- Comprehensive logging at each stage
- Log data quality metrics
- Audit trail for debugging
Requirements
- Complete DAG implementation
- Production-ready error handling
- Performance optimized
- Well-documented code
Tags
airflow
etl
data-pipeline
orchestration
Tested Models
gpt-4
claude-3-opus