Freeing Compute Caches from Serialization and Garbage Collection in Managed Big Data Analytics

11/20/2021

∙

Managed analytics frameworks (e.g., Spark) cache intermediate results in memory (on-heap) or storage devices (off-heap) to avoid costly recomputations, especially in graph processing. As datasets grow, on-heap caching requires more memory for long-lived objects, resulting in high garbage collection (GC) overhead. On the other hand, off-heap caching moves cached objects on the storage device, reducing GC overhead, but at the cost of serialization and deserialization (S/D). In this work, we propose TeraHeap, a novel approach for providing large analytics caches. TeraHeap uses two heaps within the JVM (1) a garbage-collected heap for ordinary Spark objects and (2) a large heap memory-mapped over fast storage devices for cached objects. TeraHeap eliminates both S/D and GC over cached data without imposing any language restrictions. We implement TeraHeap in Oracle's Java runtime (OpenJDK-1.8). We use five popular, memory-intensive graph analytics workloads to understand S/D and GC overheads and evaluate TeraHeap. TeraHeap improves total execution time compared to state-of-the-art Apache Spark configurations by up to 72 and non-volatile memory, respectively. Furthermore, TeraCache requires 8x less DRAM capacity to provide performance comparable or higher than native Spark. This paper opens up emerging memory and storage devices for practical use in scalable analytics caching.

READ FULL TEXT

Freeing Compute Caches from Serialization and Garbage Collection in Managed Big Data Analytics

Sign in with Google

Consider DeepAI Pro