9. Spark Internal Workflow

Understanding Spark Workflow and Optimization Techniques

Spark is a powerful data processing engine that simplifies big data workflows. Below is a detailed breakdown of Spark's workflow, data sources, write mechanisms, partitioning, bucketing, and execution configuration.

Spark Workflow

Create a DataFrame: Load data from a data source.
Perform Transformations: Apply various transformations to process the data.
Write to Target: Save the transformed data to a destination.

Data Sources

Internal: Data stored in Spark-compatible systems like HDFS, Amazon S3, Azure Data Lake Storage Gen2, or Google Cloud Storage.
External: Requires configuration to connect, e.g., Oracle Database, MySQL, Cassandra, MongoDB.

Approaches for Processing External Data

DataFrame Writer API: Specifies the folder path to store results.
File Formats:
- CSV: Not ideal; limited efficiency.
- Parquet: Best optimized and compatible.
- JSON: Bulky due to key-value pairs.
- ORC: Good optimization.
- AVRO: Requires cluster setup.
Note: File formats significantly affect storage and computation cost.
Write Modes in Spark:
- overwrite: Replaces existing data.
- ignore: Skips if the folder exists.
- append: Adds new files.
- errorIfExists: Throws an error if the folder exists.

Partitioning in Spark

Partitioning optimizes data queries by structuring data subsets to avoid full scans.

Apply partitionBy to low-cardinality columns frequently used for filtering.

Example:

df.write.partitionBy("state", "city").parquet("path_to_save")

Partition Pruning: Only relevant partitions are scanned.

Bucketing in Spark

Ideal for high-cardinality columns, bucketing divides data into buckets for better query optimization and join performance.

Use bucketBy for bucketing:

df.write.bucketBy(4, "customer_id").saveAsTable("bucketed_table")

Key Points:
- Bucketing creates files, not folders.
- A combination of partitioning and bucketing is allowed but not vice versa.

Databricks for Visualization

Steps to visualize Spark UI in Databricks:

Sign up for Databricks Community Edition.
Create a cluster and open a notebook.
Enable DBFS in Workspace settings.
Execute code and view Spark UI via the cluster dropdown.

Internals of Spark Execution

Executor Configurations: 5 CPU cores, 21 GB RAM per executor.
Example: A 10.1 GB file processed with 4 executors creates 81 partitions. Tasks are distributed and processed in stages to optimize performance.

Key Transformations

distinct(): Uses a wide transformation, requiring shuffle and aggregation.
Example: A 1.1 GB dataset creates 9 partitions, with tasks executed in multiple stages.

Dynamic vs. Static Resource Allocation

Dynamic Allocation (Default): Efficient resource utilization.
Static Allocation: Risks idle resources. Configure executors, CPU cores, and memory.

spark-submit Utility

Used for packaging and deploying code for scheduled execution.

Develop and upload the code.

Use the following command:

spark-submit --class <main_class> --master <master_url> <application_jar>

Monitor jobs in Spark UI.

Labels: Spark workflow, Spark DataFrame, Spark optimization, partitionBy, bucketBy, write modes, Spark execution, dynamic allocation, static allocation, spark-submit, Databricks visualization.