Apache Airflow ETL Pipeline

data_engineering
Python
architecture
strict_senior
Remix

Production-ready ETL pipeline with error handling, retry logic, and data validation.

12/8/2025

Prompt

Apache Airflow ETL Pipeline

Create a production-ready ETL pipeline using Apache Airflow.

Pipeline Stages

1. Extract

  • Data source: [Source: API/Database/S3]
  • Handle pagination for APIs
  • Incremental data extraction

2. Transform

  • Use pandas for data manipulation:
    • Data cleaning (null handling, duplicates)
    • Aggregations and calculations
    • Joining multiple datasets
  • Data validation and quality checks

3. Load

  • Destination: [Destination: Data Warehouse/Database]
  • Batch inserts for performance
  • Upsert logic for updates

DAG Configuration

Task Dependencies

  • Clear dependency graph
  • Use >> operator for readability

Retry Logic

  • Exponential backoff strategy
  • Maximum retry attempts
  • Retry delay configuration

SLA Monitoring

  • Set task SLAs
  • Monitor pipeline execution time

Alerts

  • Email notifications on failure
  • Custom alert callbacks
  • Slack/PagerDuty integration

TaskGroups

  • Organize related tasks
  • Improve DAG visualization
  • Logical grouping of operations

Data Quality Checks

Validation Tasks

  1. Row count validation - Ensure expected data volume
  2. Schema validation - Verify column names and types
  3. Null checks - Critical fields not null
  4. Data range checks - Values within expected ranges

Best Practices

Idempotency

  • Safe to re-run tasks
  • Deterministic outcomes
  • Handle duplicates gracefully

Logging

  • Comprehensive logging at each stage
  • Log data quality metrics
  • Audit trail for debugging

Requirements

  • Complete DAG implementation
  • Production-ready error handling
  • Performance optimized
  • Well-documented code

Tags

airflow
etl
data-pipeline
orchestration

Tested Models

gpt-4
claude-3-opus

Comments (0)

Sign in to leave a comment

Sign In
Apache Airflow ETL Pipeline | vibeprompt.directory