Stable and Consistent Membership at Scale with Rapid

03/09/2018
by   Lalith Suresh, et al.
0

We present the design and evaluation of Rapid, a distributed membership service. At Rapid's core is a scheme for multi-process cut detection (CD) that revolves around two key insights: (i) it suspects a failure of a process only after alerts arrive from multiple sources, and (ii) when a group of processes experience problems, it detects failures of the entire group, rather than conclude about each process individually. Implementing these insights translates into a simple membership algorithm with low communication overhead. We present evidence that our strategy suffices to drive unanimous detection almost-everywhere, even when complex network conditions arise, such as one-way reachability problems, firewall misconfigurations, and high packet loss. Furthermore, we present both empirical evidence and analyses that proves that the almost-everywhere detection happens with high probability. To complete the design, Rapid contains a leaderless consensus protocol that converts multi-process cut detections into a view-change decision. The resulting membership service works both in fully decentralized as well as logically centralized modes. We present an evaluation of Rapid in moderately scalable cloud settings. Rapid bootstraps 2000 node clusters 2-5.8x faster than prevailing tools such as Memberlist and ZooKeeper, remains stable in face of complex failure scenarios, and provides strong consistency guarantees. It is easy to integrate Rapid into existing distributed applications, of which we demonstrate two.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/15/2019

Autonomous Membership Service for Enclave Applications

Trusted Execution Environment, or enclave, promises to protect data conf...
research
02/04/2023

Practical View-Change-Less Protocol through Rapid View Synchronization

The emergence of blockchain technology has renewed the interest in conse...
research
02/25/2019

On reachability problems for low dimensional matrix semigroups

We consider the Membership and the Half-space Reachability Problems for ...
research
10/09/2020

A Vertex Cut based Framework for Load Balancing and Parallelism Optimization in Multi-core Systems

High-level applications, such as machine learning, are evolving from sim...
research
09/07/2021

P3FA: Unified Unicast/Multicast Forwarding with Low Egress Diversities

Multicast is an efficient way to realize one-to-many group communication...
research
10/31/2018

Democratizing Production-Scale Distributed Deep Learning

The interest and demand for training deep neural networks have been expe...

Please sign up or login with your details

Forgot password? Click here to reset