This document provides a detailed explanation of Spark's caching and persistence mechanisms, Spark UI, and Resource Manager (YARN). Here are the key points:
Spark UI and Resource Manager (YARN)
- Spark UI: Displays details of currently executed jobs, stages, and tasks. It also shows the number of executors allocated and their details.
- History Server: Stores detailed information of completed Spark applications but only once the application terminates with
spark.stop()
. - Resource Manager (YARN): A central component in Spark's resource management, ensuring resources are allocated properly. You can access the Resource Manager through a proxy setup, using tools like Cygwin on Windows or the Terminal on MAC.
Caching and Persisting Data
- Caching: In Spark, caching is used to store data in memory to speed up repeated access to it. For example, if a DataFrame is used multiple times, caching helps by avoiding the need to recompute it from scratch each time.
- RDDs: Cache by default in memory.
- DataFrames: By default, cache in memory but can spill to disk if necessary.
- Why Cache: Improves performance by reducing the need to recompute or reload data multiple times.
- Lazy Cache: The cache operation doesn't happen immediately. It occurs only when an action (like
.count()
) is triggered. - Unpersist: Removes cached data from memory or disk.
Storage Levels
- Cache vs Persist: Cache is limited to in-memory or disk storage, while persist allows for more flexibility in specifying storage levels, such as
MEMORY_AND_DISK
orDISK_ONLY
.- Persist Storage Levels:
- MEMORY_ONLY: Stores the data in memory.
- MEMORY_AND_DISK: Stores the data in both memory and disk, if needed.
- DISK_ONLY: Stores the data only on disk.
- Off-Heap Storage: Data is stored outside of the JVM heap memory.
- Unpersisting: Clears cached data, which can be done for specific tables or all cached data.
- Persist Storage Levels:
Optimizations and Best Practices
- Predicate Pushdown: Applies filters earlier in the query execution, reducing the amount of data Spark needs to process.
- Locality Levels: Data locality in Spark affects performance.
NODE_LOCAL
uses local disk, whilePROCESS_LOCAL
works with in-memory data. - Dynamic Allocation: The number of executors is dynamically adjusted based on job requirements, improving resource utilization and speeding up job execution.
External Spark Tables and Data Formats
- External Tables: You can create Spark tables that are external to Spark, meaning the data is managed outside of Spark (like in HDFS or a database). Caching for external tables needs manual refresh using
spark.sql("refresh table <table-name>")
when the backend data changes. - File Formats:
- Row-based (CSV, Avro): Scans all data, even if only a few columns are needed.
- Columnar (Parquet, ORC): Only scans the necessary columns, offering better performance and space optimization.
Example Use-Cases:
- Caching:
- Use
orders_df.cache()
for frequently used data. - Use
orders_df.unpersist()
to remove cached data when no longer needed.
- Use
- Persisting:
- Example:
orders_df.persist(StorageLevel.MEMORY_AND_DISK)
for data that doesn’t fit in memory but needs to be reused.
- Example:
This document covers important concepts for improving Spark application performance, data caching, and managing memory effectively for large datasets.