7. Caching and Persist

This document provides a detailed explanation of Spark's caching and persistence mechanisms, Spark UI, and Resource Manager (YARN). Here are the key points:

Spark UI and Resource Manager (YARN)

Spark UI: Displays details of currently executed jobs, stages, and tasks. It also shows the number of executors allocated and their details.
History Server: Stores detailed information of completed Spark applications but only once the application terminates with spark.stop().
Resource Manager (YARN): A central component in Spark's resource management, ensuring resources are allocated properly. You can access the Resource Manager through a proxy setup, using tools like Cygwin on Windows or the Terminal on MAC.

Caching and Persisting Data

Caching: In Spark, caching is used to store data in memory to speed up repeated access to it. For example, if a DataFrame is used multiple times, caching helps by avoiding the need to recompute it from scratch each time.
- RDDs: Cache by default in memory.
- DataFrames: By default, cache in memory but can spill to disk if necessary.
- Why Cache: Improves performance by reducing the need to recompute or reload data multiple times.
- Lazy Cache: The cache operation doesn't happen immediately. It occurs only when an action (like .count()) is triggered.
- Unpersist: Removes cached data from memory or disk.

Storage Levels

Cache vs Persist: Cache is limited to in-memory or disk storage, while persist allows for more flexibility in specifying storage levels, such as MEMORY_AND_DISK or DISK_ONLY.
- Persist Storage Levels:
  - MEMORY_ONLY: Stores the data in memory.
  - MEMORY_AND_DISK: Stores the data in both memory and disk, if needed.
  - DISK_ONLY: Stores the data only on disk.
  - Off-Heap Storage: Data is stored outside of the JVM heap memory.
- Unpersisting: Clears cached data, which can be done for specific tables or all cached data.

Optimizations and Best Practices

Predicate Pushdown: Applies filters earlier in the query execution, reducing the amount of data Spark needs to process.
Locality Levels: Data locality in Spark affects performance. NODE_LOCAL uses local disk, while PROCESS_LOCAL works with in-memory data.
Dynamic Allocation: The number of executors is dynamically adjusted based on job requirements, improving resource utilization and speeding up job execution.

External Spark Tables and Data Formats

External Tables: You can create Spark tables that are external to Spark, meaning the data is managed outside of Spark (like in HDFS or a database). Caching for external tables needs manual refresh using spark.sql("refresh table <table-name>") when the backend data changes.
File Formats:
- Row-based (CSV, Avro): Scans all data, even if only a few columns are needed.
- Columnar (Parquet, ORC): Only scans the necessary columns, offering better performance and space optimization.

Example Use-Cases:

Caching:
- Use orders_df.cache() for frequently used data.
- Use orders_df.unpersist() to remove cached data when no longer needed.
Persisting:
- Example: orders_df.persist(StorageLevel.MEMORY_AND_DISK) for data that doesn’t fit in memory but needs to be reused.

This document covers important concepts for improving Spark application performance, data caching, and managing memory effectively for large datasets.