Improving Spark Performance with Memory Management

In the world of big data processing, Apache Spark has emerged as a powerful tool due to its speed, ease of use, and advanced analytics capabilities. Memory management is one of the key factors that greatly affects Spark’s performance. Effective memory management is a technical requirement, essential to the seamless operation of Spark programs, their ability to avoid frequent errors, and their optimal utilization of resources.

Spark’s Storage Memory within JVM Memory Layout

Understanding both On-Heap and Off-Heap memory management is important for efficient JVM memory utilization in Spark executors.

On-Heap memory management (In-Heap memory):

Objects are allocated on the JVM Heap.
Memory is managed by the garbage collector (GC).
Garbage collection is responsible for reclaiming unused memory.
Objects in the JVM Heap are subject to GC overhead.

Off-Heap memory management (External memory):

Objects are allocated in memory outside the JVM.
Memory allocation is done through serialization and managed by the application.
Off-heap memory is not bound by GC.
Developers have more control over managing Off-Heap memory.
Off-heap memory can help reduce GC pauses and improve performance for certain use cases.

Here is the explanation:

Executor Memory: Executor memory refers to the memory allocated for running tasks or jobs in a distributed computing framework like Apache Spark. It is the memory available to each executor (a process or thread) to perform computation and store intermediate data during data processing tasks. Used for computation in shuffles, joins, aggregations, and other operations.
Younger Generation: In the JVM’s memory management, the younger generation refers to a region of memory where newly created objects are allocated. It is divided into two areas: Eden space and Survivor space.
- Eden Space: This is the initial allocation area within the younger generation where objects are created. When Eden space becomes full, a minor garbage collection is triggered, and some objects are either moved to the Survivor spaces or promoted to the old generation.
- Survivor Spaces: There are typically two survivor spaces, often referred to as Survivor 0 and Survivor 1. These survivor spaces act as a temporary holding area for objects that survive multiple garbage collection cycles.
Old Generation: The old generation, also known as the tenured generation, is the area of memory where long-lived objects are allocated. Objects that survive multiple minor garbage collections are eventually promoted to the older generation. Major garbage collections, which are more expensive than minor collections, are performed in the old generation.
- Permanent Generation / Metaspace: Before Java 8, the permanent generation was a part of the JVM’s memory structure. It stored metadata about classes, methods, and other runtime information. However, starting from Java 8, the permanent generation was replaced by Metaspace, which is a native memory space outside of the JVM heap. Metaspace dynamically allocates memory to store class metadata, and it is more flexible and avoids the limitations of fixed-size permanent generation.
Off-Heap space
- Direct Byte Buffer: Direct Byte Buffer is a specific type of buffer that allows direct access to a region of memory outside of the JVM heap. It is used for efficient I/O operations and can be more performant in certain scenarios compared to traditional heap-based buffers.

In general, Objects read and write Speed is:

On- heap > off-heap > disk

Spark Memory Architecture: Executor, Storage, User, and Reserved Memory

Execution Memory

Used for intermediate data needed during computation tasks such as shuffles, joins, and aggregations.
Execution memory is allocated from the JVM heap and is crucial for the smooth execution of Spark jobs.

Storage Memory

Used for caching and propagating data across the cluster. This includes RDD or DataFrame caching.
Storage memory can reside within the JVM heap or off-heap memory, helping to reduce JVM garbage collection overhead and improving performance.
When execution memory is not fully utilized, the storage memory can use the available memory, and vice versa.

User Memory

User memory is a vital part of the memory pool in Apache Spark, crucial for managing internal objects and data structures used in RDD transformations, aggregations, and operations like mapPartitions.
It stores essential data structures, hash tables, and other objects necessary for user-specific computations and transformations.
Enhance efficiency by optimizing code for better memory usage and performance.
Improve performance by caching frequently accessed RDDs or Dataframes.
Persist RDDs or Dataframes using appropriate storage levels (e.g., MEMORY_AND_DISK) to minimize recomputation and enhance user memory utilization efficiency.

Reserved Memory

Reserved memory is a specific part of the memory pool set aside to ensure the smooth operation of the JVM and system processes.
It helps keep the system stable by preventing out-of-memory errors and ensuring the JVM has enough resources for important tasks like garbage collection.
Set memory settings properly and regularly check memory usage to keep the system stable. Adjust as needed.
Don’t allocate more memory than the system can handle; understand your application’s memory needs and plan accordingly.

Spark Memory Managers

Spark provides two types of memory managers:

Static Memory Manager (Static Memory Management), and
Unified Memory Manager (Unified memory management)

Starting from Spark 1.6.0, the Unified Memory Manager became the default memory manager for Spark, while the Static Memory Manager was deprecated due to its limited flexibility.

Both memory managers allocate a portion of the Java Heap for processing Spark applications, while the remaining memory is reserved for Java class references and metadata usage.

Note: There will only be one Memory Manager per JVM.

Static Memory Manager(SMM)

From Spark 1.0

Static memory management mode in Spark uses predetermined configurations and rules for memory allocation.
Memory sizes for execution, storage, and shuffle memory are predetermined and static.
This mode does not adapt to dynamic memory requirements or workload characteristics.
It was used in older versions of Spark.
Static memory management mode may not be optimal for complex or memory-intensive applications.

The terms “static memory management” and “legacy memory management” are often used interchangeably in the context of Spark. Both terms refer to the older memory management model used in earlier versions of Spark.

In Spark 1.6+, Static Memory Management can be enabled via the spark.memory.useLegacyMode=true parameter.

Parameter	Description	Default Value
spark.memory.useLegacyMode	Enables the legacy memory management mode in Spark, dividing the heap into fixed-size regions.	false
spark.shuffle.memoryFraction	Fraction of heap memory used for aggregation and cogroup operations during shuffles.	0.2
spark.storage.memoryFraction	Fraction of heap memory allocated for Spark’s in-memory cache.	0.6
spark.storage.unrollFraction	Fraction of storage memory used for unrolling blocks in memory. Dynamically allocated by dropping existing blocks when space is insufficient.	0.2

Unified Memory Manager(UMM)

Since Spark 1.6.0, the Static Memory Manager has been replaced with a new memory manager for dynamic memory allocation.
The new memory manager introduces a Unified memory container that is shared by the storage and execution components.
The memory manager dynamically allocates memory between storage and execution based on their respective needs.
When execution memory is not fully utilized, the storage memory can use the available memory, and vice versa.
The acquireMemory() function is used to adjust the memory allocation. It expands one memory pool while shrinking the other to accommodate changing memory requirements.

Parameter	Description	Default Value
spark.memory.useLegacyMode	Enables the legacy memory management mode in Spark, dividing the heap into fixed-size regions.	false
spark.memory.fraction	Fraction of JVM heap space used for execution and storage.	0.6
spark.memory.storageFraction	Fraction of the unified memory used for storage (caching).	0.5
spark.memory.offHeap.enabled	If true, Spark will attempt to use off-heap memory for storage and execution.	false
spark.memory.offHeap.size	The total amount of memory for off-heap storage and execution, specified in bytes (e.g., 2g).	0
spark.storage.unrollMemoryThreshold	Minimum memory threshold for unrolling blocks in Spark’s storage memory.	1m
spark.executor.memory	Amount of memory to use for each executor process, specified in bytes (e.g., 4g).	1g
spark.executor.pyspark.memory	Amount of memory to use for PySpark, specified in bytes (e.g., 4g).	(not set)

Borrowed storage memory can be evicted at any time to free up space.
In the initial design, borrowed execution memory does not support eviction due to complexities in implementation.

Comparing Memory Management Techniques: Static vs Unified Memory Management

	Static Memory Management	Unified Memory Management
Advantages	Simple and predictable memory allocation	Dynamically adjusts memory allocation based on workload needs
	Well-suited for simple and small-scale applications	Optimizes memory utilization and reduces memory-related issues
		Adapts to changing memory requirements and workload characteristics
Disadvantages	Does not adapt to dynamic memory needs or workload changes	Complexity and overhead in managing dynamic memory allocation
	Inefficient memory utilization for complex or large-scale apps
	Limited performance optimization capabilities

PySpark application to use the legacy (static) memory management mode

The code enables the legacy memory management mode in a Spark application.
It is useful for maintaining backward compatibility or working with legacy code.
However, it is not recommended for new applications or performance optimization.

PySpark application to use the Unified Memory Manager

The default and recommended memory management mode in Spark is unified memory management.
Unified memory management offers better flexibility and efficient memory allocation.
It dynamically adjusts memory based on workload characteristics.
Unified memory management improves performance and reduces memory-related issues.

Conclusion

Efficient memory management is key to getting the best performance from your Spark applications. Properly configuring Spark’s memory settings can avoid common issues like memory overflow and slow performance. Understanding how Spark uses memory helps you make better allocate resources, ensuring your applications run smoothly and efficiently.

For More Details, Diggibyte Technologies Pvt Ltd has all the experts you need. Contact us Today to embed intelligence into your organization.

Author: Marripudi Haritha Sumanjali