DeepAI AI Chat
Log In Sign Up

Low-Latency, High-Throughput Garbage Collection (Extended Version)

by   Wenyu Zhao, et al.

Production garbage collectors make substantial compromises in pursuit of reduced pause times. They require far more CPU cycles and memory than prior simpler collectors. concurrent copying collectors (C4, ZGC, and Shenandoah) suffer from the following design limitations. 1) Concurrent copying. They only reclaim memory by copying, which is inherently expensive with high memory bandwidth demands. Concurrent copying also requires expensive read and write barriers. 2) Scalability. They depend on tracing, which in the limit and in practice does not scale. 3) Immediacy. They do not reclaim older objects promptly, incurring high memory overheads. We present LXR, which takes a very different approach to optimizing responsiveness and throughput by minimizing concurrent collection work and overheads. 1) LXR reclaims most memory without any copying by using the Immix heap structure. It then combats fragmentation with limited judicious stop-the-world copying. 2) LXR uses reference counting to achieve both scalability and immediacy, promptly reclaiming young and old objects. It uses concurrent tracing as needed for identifying cyclic garbage. 3) To minimize pause times while allowing judicious copying of mature objects, LXR introduces remembered sets for reference counting and concurrent decrement processing. 4) LXR introduces a novel low-overhead write barrier that combines coalescing reference counting, concurrent tracing, and remembered set maintenance. The result is a collector with excellent responsiveness and throughput. On the widely-used Lucene search engine with a generously sized heap, LXR has 6x higher throughput while delivering 30x lower 99.9 percentile tail latency than the popular Shenandoah production collector in its default configuration.


page 1

page 2

page 3

page 4


Hermes: a Fast, Fault-Tolerant and Linearizable Replication Protocol

Today's datacenter applications are underpinned by datastores that are r...

Sherman: A Write-Optimized Distributed B+Tree Index on Disaggregated Memory

Memory disaggregation architecture physically separates CPU and memory i...

High-throughput Pairwise Alignment with the Wavefront Algorithm using Processing-in-Memory

We show that the wavefront algorithm can achieve higher pairwise read al...

Turning Manual Concurrent Memory Reclamation into Automatic Reference Counting

Safe memory reclamation (SMR) schemes are an essential tool for lock-fre...

Concurrent Reference Counting and Resource Management in Wait-free Constant Time

A common problem when implementing concurrent programs is efficiently pr...

Concurrency and Privacy with Payment-Channel Networks

Permissionless blockchains protocols such as Bitcoin are inherently limi...

A Non-blocking Buddy System for Scalable Memory Allocation on Multi-core Machines

Common implementations of core memory allocation components, like the Li...