9. Spark Internal Workflow

 Understanding Spark Workflow and Optimization Techniques

Spark is a powerful data processing engine that simplifies big data workflows. Below is a detailed breakdown of Spark's workflow, data sources, write mechanisms, partitioning, bucketing, and execution configuration.


Spark Workflow

  1. Create a DataFrame: Load data from a data source.
  2. Perform Transformations: Apply various transformations to process the data.
  3. Write to Target: Save the transformed data to a destination.

Data Sources

  • Internal: Data stored in Spark-compatible systems like HDFS, Amazon S3, Azure Data Lake Storage Gen2, or Google Cloud Storage.
  • External: Requires configuration to connect, e.g., Oracle Database, MySQL, Cassandra, MongoDB.

Approaches for Processing External Data

  1. DataFrame Writer API: Specifies the folder path to store results.
    File Formats:

    • CSV: Not ideal; limited efficiency.
    • Parquet: Best optimized and compatible.
    • JSON: Bulky due to key-value pairs.
    • ORC: Good optimization.
    • AVRO: Requires cluster setup.

    Note: File formats significantly affect storage and computation cost.

  2. Write Modes in Spark:

    • overwrite: Replaces existing data.
    • ignore: Skips if the folder exists.
    • append: Adds new files.
    • errorIfExists: Throws an error if the folder exists.

Partitioning in Spark

Partitioning optimizes data queries by structuring data subsets to avoid full scans.

  • Apply partitionBy to low-cardinality columns frequently used for filtering.
  • Example:
    df.write.partitionBy("state", "city").parquet("path_to_save")
    
  • Partition Pruning: Only relevant partitions are scanned.

Bucketing in Spark

Ideal for high-cardinality columns, bucketing divides data into buckets for better query optimization and join performance.

  • Use bucketBy for bucketing:
    df.write.bucketBy(4, "customer_id").saveAsTable("bucketed_table")
    
  • Key Points:
    • Bucketing creates files, not folders.
    • A combination of partitioning and bucketing is allowed but not vice versa.

Databricks for Visualization

Steps to visualize Spark UI in Databricks:

  1. Sign up for Databricks Community Edition.
  2. Create a cluster and open a notebook.
  3. Enable DBFS in Workspace settings.
  4. Execute code and view Spark UI via the cluster dropdown.

Internals of Spark Execution

  • Executor Configurations: 5 CPU cores, 21 GB RAM per executor.
  • Example: A 10.1 GB file processed with 4 executors creates 81 partitions. Tasks are distributed and processed in stages to optimize performance.

Key Transformations

  • distinct(): Uses a wide transformation, requiring shuffle and aggregation.
    Example: A 1.1 GB dataset creates 9 partitions, with tasks executed in multiple stages.

Dynamic vs. Static Resource Allocation

  • Dynamic Allocation (Default): Efficient resource utilization.
  • Static Allocation: Risks idle resources. Configure executors, CPU cores, and memory.

spark-submit Utility

Used for packaging and deploying code for scheduled execution.

  1. Develop and upload the code.
  2. Use the following command:
    spark-submit --class <main_class> --master <master_url> <application_jar>
    
  3. Monitor jobs in Spark UI.

Labels: Spark workflow, Spark DataFrame, Spark optimization, partitionBy, bucketBy, write modes, Spark execution, dynamic allocation, static allocation, spark-submit, Databricks visualization.

Akash

I am working as a Data Engineer

Post a Comment

Previous Post Next Post