Build PySpark ETL Pipeline

data_engineering
Python
architecture
strict_senior
Remix

Create a production-ready PySpark ETL pipeline with data quality checks and optimizations.

12/8/2025

Prompt

Build a production-ready Apache Spark ETL pipeline for [PIPELINE_NAME] with the following specifications:

Requirements

  • Pipeline name: [PIPELINE_NAME]
  • Source data locations: [SOURCE_1], [SOURCE_2], [SOURCE_3]
  • Source formats: [CSV/Parquet/JDBC/JSON]
  • Target destination: [S3/Delta Lake/Database/HDFS]
  • Partition keys: [KEY_1], [KEY_2]
  • Primary deduplication columns: [DEDUP_COLUMNS]
  • Required transformations: [LIST_TRANSFORMATIONS]
  • Aggregation metrics: [METRICS_TO_CALCULATE]
  • Window operations needed: [YES/NO]
  • Join operations: [DESCRIBE_JOINS]
  • Spark config requirements: [MEMORY/CORES/ADAPTIVE_QUERY]

Deliverables

Generate complete PySpark code with:

SparkSession configuration - with adaptive query execution, memory settings, and app name Extract functions - read from [SOURCE_FORMATS] at [SOURCE_LOCATIONS] Transform functions including:

  • Data cleaning: drop duplicates on [DEDUP_COLUMNS], handle nulls
  • Type casting: convert columns to proper types (IntegerType, TimestampType, etc.)
  • Feature engineering: extract temporal features from timestamps
  • Aggregations: group by [PARTITION_KEYS] and calculate [METRICS]
  • Window functions: partition by [WINDOW_PARTITION] and calculate running totals/rankings
  • Joins: join datasets on [JOIN_KEYS] with [INNER/LEFT/RIGHT] strategy Data quality check function - validate row counts, nulls, duplicates, schema Load functions - write to [TARGET] partitioned by [PARTITION_KEYS] Optimization strategies:
  • DataFrame caching for reused data
  • Repartitioning for parallelism
  • Broadcast joins for small lookup tables
  • Coalesce for output file consolidation Main pipeline orchestration function

Include error handling, logging, and performance optimizations throughout.

Tags

spark
pyspark
etl
big-data

Tested Models

gpt-4-turbo
claude-3-opus

Comments (0)

Sign in to leave a comment

Sign In
Build PySpark ETL Pipeline | vibeprompt.directory