Understanding Spark Workflow and Optimization Techniques
Spark is a powerful data processing engine that simplifies big data workflows. Below is a detailed breakdown of Spark's workflow, data sources, write mechanisms, partitioning, bucketing, and execution configuration.
Spark Workflow
- Create a DataFrame: Load data from a data source.
- Perform Transformations: Apply various transformations to process the data.
- Write to Target: Save the transformed data to a destination.
Data Sources
- Internal: Data stored in Spark-compatible systems like HDFS, Amazon S3, Azure Data Lake Storage Gen2, or Google Cloud Storage.
- External: Requires configuration to connect, e.g., Oracle Database, MySQL, Cassandra, MongoDB.
Approaches for Processing External Data
-
DataFrame Writer API: Specifies the folder path to store results.
File Formats:- CSV: Not ideal; limited efficiency.
- Parquet: Best optimized and compatible.
- JSON: Bulky due to key-value pairs.
- ORC: Good optimization.
- AVRO: Requires cluster setup.
Note: File formats significantly affect storage and computation cost.
-
Write Modes in Spark:
overwrite
: Replaces existing data.ignore
: Skips if the folder exists.append
: Adds new files.errorIfExists
: Throws an error if the folder exists.
Partitioning in Spark
Partitioning optimizes data queries by structuring data subsets to avoid full scans.
- Apply
partitionBy
to low-cardinality columns frequently used for filtering. - Example:
df.write.partitionBy("state", "city").parquet("path_to_save")
- Partition Pruning: Only relevant partitions are scanned.
Bucketing in Spark
Ideal for high-cardinality columns, bucketing divides data into buckets for better query optimization and join performance.
- Use
bucketBy
for bucketing:df.write.bucketBy(4, "customer_id").saveAsTable("bucketed_table")
- Key Points:
- Bucketing creates files, not folders.
- A combination of partitioning and bucketing is allowed but not vice versa.
Databricks for Visualization
Steps to visualize Spark UI in Databricks:
- Sign up for Databricks Community Edition.
- Create a cluster and open a notebook.
- Enable DBFS in Workspace settings.
- Execute code and view Spark UI via the cluster dropdown.
Internals of Spark Execution
- Executor Configurations: 5 CPU cores, 21 GB RAM per executor.
- Example: A 10.1 GB file processed with 4 executors creates 81 partitions. Tasks are distributed and processed in stages to optimize performance.
Key Transformations
- distinct(): Uses a wide transformation, requiring shuffle and aggregation.
Example: A 1.1 GB dataset creates 9 partitions, with tasks executed in multiple stages.
Dynamic vs. Static Resource Allocation
- Dynamic Allocation (Default): Efficient resource utilization.
- Static Allocation: Risks idle resources. Configure executors, CPU cores, and memory.
spark-submit Utility
Used for packaging and deploying code for scheduled execution.
- Develop and upload the code.
- Use the following command:
spark-submit --class <main_class> --master <master_url> <application_jar>
- Monitor jobs in Spark UI.
Labels: Spark workflow, Spark DataFrame, Spark optimization, partitionBy, bucketBy, write modes, Spark execution, dynamic allocation, static allocation, spark-submit, Databricks visualization.