Build PySpark ETL Pipeline
data_engineering
Python
architecture
strict_senior
Create a production-ready PySpark ETL pipeline with data quality checks and optimizations.
By ethan_w
12/8/2025
Prompt
Build a production-ready Apache Spark ETL pipeline for [PIPELINE_NAME] with the following specifications:
Requirements
- Pipeline name: [PIPELINE_NAME]
- Source data locations: [SOURCE_1], [SOURCE_2], [SOURCE_3]
- Source formats: [CSV/Parquet/JDBC/JSON]
- Target destination: [S3/Delta Lake/Database/HDFS]
- Partition keys: [KEY_1], [KEY_2]
- Primary deduplication columns: [DEDUP_COLUMNS]
- Required transformations: [LIST_TRANSFORMATIONS]
- Aggregation metrics: [METRICS_TO_CALCULATE]
- Window operations needed: [YES/NO]
- Join operations: [DESCRIBE_JOINS]
- Spark config requirements: [MEMORY/CORES/ADAPTIVE_QUERY]
Deliverables
Generate complete PySpark code with:
SparkSession configuration - with adaptive query execution, memory settings, and app name Extract functions - read from [SOURCE_FORMATS] at [SOURCE_LOCATIONS] Transform functions including:
- Data cleaning: drop duplicates on [DEDUP_COLUMNS], handle nulls
- Type casting: convert columns to proper types (IntegerType, TimestampType, etc.)
- Feature engineering: extract temporal features from timestamps
- Aggregations: group by [PARTITION_KEYS] and calculate [METRICS]
- Window functions: partition by [WINDOW_PARTITION] and calculate running totals/rankings
- Joins: join datasets on [JOIN_KEYS] with [INNER/LEFT/RIGHT] strategy Data quality check function - validate row counts, nulls, duplicates, schema Load functions - write to [TARGET] partitioned by [PARTITION_KEYS] Optimization strategies:
- DataFrame caching for reused data
- Repartitioning for parallelism
- Broadcast joins for small lookup tables
- Coalesce for output file consolidation Main pipeline orchestration function
Include error handling, logging, and performance optimizations throughout.
Tags
spark
pyspark
etl
big-data
Tested Models
gpt-4-turbo
claude-3-opus