Build PySpark ETL Pipeline

Build a production-ready Apache Spark ETL pipeline for [PIPELINE_NAME] with the following specifications:

Requirements

Pipeline name: [PIPELINE_NAME]
Source data locations: [SOURCE_1], [SOURCE_2], [SOURCE_3]
Source formats: [CSV/Parquet/JDBC/JSON]
Target destination: [S3/Delta Lake/Database/HDFS]
Partition keys: [KEY_1], [KEY_2]
Primary deduplication columns: [DEDUP_COLUMNS]
Required transformations: [LIST_TRANSFORMATIONS]
Aggregation metrics: [METRICS_TO_CALCULATE]
Window operations needed: [YES/NO]
Join operations: [DESCRIBE_JOINS]
Spark config requirements: [MEMORY/CORES/ADAPTIVE_QUERY]

Deliverables

Generate complete PySpark code with:

SparkSession configuration - with adaptive query execution, memory settings, and app name Extract functions - read from [SOURCE_FORMATS] at [SOURCE_LOCATIONS] Transform functions including:

Data cleaning: drop duplicates on [DEDUP_COLUMNS], handle nulls
Type casting: convert columns to proper types (IntegerType, TimestampType, etc.)
Feature engineering: extract temporal features from timestamps
Aggregations: group by [PARTITION_KEYS] and calculate [METRICS]
Window functions: partition by [WINDOW_PARTITION] and calculate running totals/rankings
Joins: join datasets on [JOIN_KEYS] with [INNER/LEFT/RIGHT] strategy Data quality check function - validate row counts, nulls, duplicates, schema Load functions - write to [TARGET] partitioned by [PARTITION_KEYS] Optimization strategies:
DataFrame caching for reused data
Repartitioning for parallelism
Broadcast joins for small lookup tables
Coalesce for output file consolidation Main pipeline orchestration function

Include error handling, logging, and performance optimizations throughout.

Build PySpark ETL Pipeline

Prompt

Requirements

Deliverables

Tags

Tested Models

Comments (0)