NetChain: Scale-Free Sub-RTT Coordination (Extended Version)

02/22/2018
by   Xin Jin, et al.
0

Coordination services are a fundamental building block of modern cloud systems, providing critical functionalities like configuration management and distributed locking. The major challenge is to achieve low latency and high throughput while providing strong consistency and fault-tolerance. Traditional server-based solutions require multiple round-trip times (RTTs) to process a query. This paper presents NetChain, a new approach that provides scale-free sub-RTT coordination in datacenters. NetChain exploits recent advances in programmable switches to store data and process queries entirely in the network data plane. This eliminates the query processing at coordination servers and cuts the end-to-end latency to as little as half of an RTT---clients only experience processing delay from their own software stack plus network delay, which in a datacenter setting is typically much smaller. We design new protocols and algorithms based on chain replication to guarantee strong consistency and to efficiently handle switch failures. We implement a prototype with four Barefoot Tofino switches and four commodity servers. Evaluation results show that compared to traditional server-based solutions like ZooKeeper, our prototype provides orders of magnitude higher throughput and lower latency, and handles failures gracefully.

READ FULL TEXT
research
08/05/2022

Scale-friendly In-network Coordination

The programmability of modern network devices has led to innovative rese...
research
04/05/2018

Scaling Out Acid Applications with Operation Partitioning

OLTP applications with high workloads that cannot be served by a single ...
research
01/10/2020

Fault Tolerance for Service Function Chains

Traffic in enterprise networks typically traverses a sequence of middleb...
research
10/12/2020

RackSched: A Microsecond-Scale Scheduler for Rack-Scale Computers (Technical Report)

Low-latency online services have strict Service Level Objectives (SLOs) ...
research
04/18/2019

Harmonia: Near-Linear Scalability for Replicated Storage with In-Network Conflict Detection

Distributed storage employs replication to mask failures and improve ava...
research
10/31/2022

uBFT: Microsecond-scale BFT using Disaggregated Memory [Extended Version]

We propose uBFT, the first State Machine Replication (SMR) system to ach...
research
12/20/2022

Tuning the Tail Latency of Distributed Queries Using Replication

Querying graph data with low latency is an important requirement in appl...

Please sign up or login with your details

Forgot password? Click here to reset