FLAC: A Robust Failure-Aware Atomic Commit Protocol for Distributed Transactions

02/09/2023
by   Hexiang Pan, et al.
0

In distributed transaction processing, atomic commit protocol (ACP) is used to ensure database consistency. With the use of commodity compute nodes and networks, failures such as system crashes and network partitioning are common. It is therefore important for ACP to dynamically adapt to the operating condition for efficiency while ensuring the consistency of the database. Existing ACPs often assume stable operating conditions, hence, they are either non-generalizable to different environments or slow in practice. In this paper, we propose a novel and practical ACP, called Failure-Aware Atomic Commit (FLAC). In essence, FLAC includes three protocols, which are specifically designed for three different environments: (i) no failure occurs, (ii) participant nodes might crash but there is no delayed connection, or (iii) both crashed nodes and delayed connection can occur. It models these environments as the failure-free, crash-failure, and network-failure robustness levels. During its operation, FLAC can monitor if any failure occurs and dynamically switch to operate the most suitable protocol, using a robustness level state machine, whose parameters are fine-tuned by reinforcement learning. Consequently, it improves both the response time and throughput, and effectively handles nodes distributed across the Internet where crash and network failures might occur. We implement FLAC in a distributed transactional key-value storage system based on Google Percolator and evaluate its performance with both a micro benchmark and a macro benchmark of real workload. The results show that FLAC achieves up to 2.22x throughput improvement and 2.82x latency speedup, compared to existing ACPs for high-contention workloads.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/22/2018

RCanopus: Making Canopus Resilient to Failures and Byzantine Faults

Distributed consensus is a key enabler for many distributed systems incl...
research
05/08/2019

Atomic Commitment Across Blockchains

The recent adoption of blockchain technologies and open permissionless n...
research
02/18/2020

Failout: Achieving Failure-Resilient Inference in Distributed Neural Networks

When a neural network is partitioned and distributed across physical nod...
research
04/05/2018

Scaling Out Acid Applications with Operation Partitioning

OLTP applications with high workloads that cannot be served by a single ...
research
02/19/2021

Cornus: One-Phase Commit for Cloud Databases with Storage Disaggregation

Two-phase commit (2PC) has been widely used in distributed databases to ...
research
07/21/2023

Transactional Indexes on (RDMA or CXL-based) Disaggregated Memory with Repairable Transaction

The failure atomic and isolated execution of clients operations is a def...
research
02/04/2020

Providing Insights for Queries affected by Failures and Stragglers

Interactive time responses are a crucial requirement for users analyzing...

Please sign up or login with your details

Forgot password? Click here to reset