Can 100 Machines Agree?

11/18/2019 ∙ by Rachid Guerraoui, et al. ∙ 0

Agreement protocols have been typically deployed at small scale, e.g., using three to five machines. This is because these protocols seem to suffer from a sharp performance decay. More specifically, as the size of a deployment—i.e., degree of replication—increases, the protocol performance greatly decreases. There is not much experimental evidence for this decay in practice, however, notably for larger system sizes, e.g., beyond a handful of machines. In this paper we execute agreement protocols on up to 100 machines and observe on their performance decay. We consider well-known agreement protocols part of mature systems, such as Apache ZooKeeper, etcd, and BFT-Smart, as well as a chain and a novel ring-based agreement protocol which we implement ourselves. We provide empirical evidence that current agreement protocols execute gracefully on 100 machines. We observe that throughput decay is initially sharp (consistent with previous observations); but intriguingly—as each system grows beyond a few tens of replicas—the decay dampens. For chain- and ring-based replication, this decay is slower than for the other systems. The positive takeaway from our evaluation is that mature agreement protocol implementations can sustain out-of-the-box 300 to 500 requests per second when executing on 100 replicas on a wide-area public cloud platform. Chain- and ring-based replication can reach between 4K and 11K (up to 20x improvements) depending on the fault assumptions.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

An agreement (or consensus) protocol [dwork88dls, lam98paxos] allows nodes in a distributed system to agree on every action they take, and hence maintain their state consistently. Agreement protocols are essential for providing strong consistency, or linearizability [HerlihyW90], in distributed systems, and such protocols are able to withstand even arbitrary (i.e., Byzantine) failures [castro2002, lamp82byzantine].

According to distributed systems folklore, agreement protocols are too expensive if deployed on more than a handful of machines [bezerra2016strong, cowling2006hq, Kermarrec2000, hu10zookeeper]. Indeed, there is a tradeoff between performance and fault-tolerance, and some performance decay is inherent in strongly-consistent systems [abd05quorumupdate]. When growing in size, such a system can tolerate more faults—but performance does not scale accordingly. This is because the system replicas (more precisely, a quorum of them) must agree on every operation. Hence, as the number of machines increases, the cost for agreement increases.

A few workarounds exist to deal with this performance decay. First, some systems ensure strong consistency for only a small, critical subset of their state (e.g., configuration), while the rest of the system has a scalable design under weaker guarantees [adya2010centrifuge, gh03gfs, qi13espresso]. The critical part of the system builds on mature agreement protocols such as ZooKeeper [hu10zookeeper], etcd [etcd, ongaro14search], Consul [mishra1993consul], or Boxwood [maccormick2004boxwood].

A second workaround is sharding [glenden11scatter, ab13sharding]. In this case, the service state is broken down into disjoint shards, each shard running as a separate agreement cluster [bezerra2016strong, co13spanner]. Additional mechanism for cross-shard coordination, such as 2PC [ba11megastore], ensures that the whole system stays consistent.

Yet a third workaround consists in abandoning strong consistency, eschewing agreement protocols [bre12cap, Vogels09]. Avoiding agreement protocols is sometimes possible for certain data types, e.g., CRDTs [agu11das, du18beat, gup16nonconsensus, sh11crdt]. But for solving certain problems, notably general state machine replication (SMR), databases, or smart contracts, an agreement building block is necessary [co13spanner, guerra19cnc, sou18byzantine].

Briefly, the purpose of these workarounds is to avoid executing agreement at a larger scale. Consequently, agreement protocols have almost never been deployed, in practice, on more than a few machines, typically three to five [co13spanner]. Today there is not much empirical evidence of their throughput decay. For instance, we do not know how ZooKeeper or PBFT [castro2002] perform with, say, machines. In fact, anecdotal evidence suggests that agreement protocols often do not work altogether beyond a few machines [clement09making, gueta2018sbft, sou18byzantine].

This question—of performance decay for agreement protocols—is not only of academic interest. For example, agreement protocols are important in decentralized services: they stand at the heart of distributed ledger applications in permissioned environments [hyperledger, sou18byzantine]. For these applications, SMR protocols are expected to run on at least a few tens of machines [Croman16, vuko15quest]. Another example is in a sharded design: under the Byzantine fault-tolerance model, shards cannot be too small, otherwise an adversary can easily compromise the whole system. In this case, it is critical for each shard to comprise tens or hundreds of machines [koko17omniledger, Kermarrec2000]. But agreement protocols struggle from the “they do not scale” stigma and the lack of experiments around their performance decay.

In this chapter we address the void in the literature by deploying and observing how the size of a system (executing agreement) impacts its performance. We focus on SMR systems, since agreement protocols most commonly appear in such systems. Our primary goal is to obtain empirical evidence of how SMR performance decays in practice at larger size, and hopefully allay some of the skepticism around the ability of such systems to execute across tens or hundreds of replicas (i.e., machines). We deploy and evaluate five SMR systems on up to replicas and report on our results.

The first three systems we study are well-known SMR implementations: ZooKeeper [hu10zookeeper] and etcd [etcd], which are crash fault-tolerant (CFT), and BFT-Smart [bessani2014state], which is Byzantine fault-tolerant (BFT). Consistently with previous observations [bezerra2016strong, cowling2006hq, hu10zookeeper], we observe that their throughput decays sharply at small scale. The interesting part is that this sharp decay does not persist. Overall their throughput follows an inversely proportional decay rate, and throughput decay dampens as systems get larger, e.g., beyond replicas.

ZooKeeper, etcd, and BFT-Smart execute most efficiently—obtain best performance—when deployed at their smallest size, i.e., running on replicas ( for BFT-Smart). Throughput drops to of its best value at replicas. When running on replicas, the throughput decays to almost . On replicas the throughput drops to roughly of its best value. In absolute numbers, these systems sustain to rps (requests per second) at replicas on modest hardware in a public cloud platform. The average latency is below s, while the 99th percentile is s, even for BFT.

These three systems are hardened SMR implementations and we choose them for their maturity. We complement the performance observations with a stability study. Briefly, we seek to understand whether these systems are capable to function despite faults at large scale. More precisely, we inject a fault in their primary (i.e., leader) replica and evaluate their ability to recover. We find that ZooKeeper recovers excellently (in a few seconds), indicating that this system can perform predictably at scale, for instance to implement a replicated system across hundreds of nodes. The other two systems are slower to recover or have difficulties doing so at replicas.

The fourth system we investigate is , based on chain replication [van04chain]. This system is throughput-optimized, so it helps us delineate the ideal case, namely, a throughput upper bound. When growing from to replicas, throughput in decays very slowly, from k rps to k rps (i.e., to of its best value). If we place replicas carefully on the wide-area network so as to minimize chain traversal time, exhibits below seconds latency.

It can be misleading, however, to praise chain replication as the ideal agreement protocol. does not suffer from performance decay as severely as others, indeed—but only in graceful executions (i.e., failure-free and synchronous network). In non-graceful runs, throughput drops to zero. This protocol sacrifices availability, because it must reconfigure (pausing execution) to remove any faulty replica [van04chain]. This system relies on a synchronous model with fail-stop faults, a strong assumption. Worse still, the chain is as efficient as its weakest link, so a single straggler can drag down the performance of the whole system.

The fifth system employs a ring overlay (a chain generalization). We call this system , and we design it ourselves. In contrast to , this system does not pause execution for reconfiguration, maintaining availability despite asynchrony or faults.

Unlike prior solutions, does not rely on reconfiguration [van04chain, van12byzantine] nor a classic broadcast mode [aublin15next700, knezevic12high] for masking faults or asynchrony. Doing so would incur downtime and hurt performance. Instead, we take the following approach: Each replica keeps fallback (i.e., redundant) connections to other replicas. When faulty replicas prevent (or slow down) progress, a correct replica can activate its fallback path(s) to restore progress and maintain availability. The goal of this simple mechanism in is to selectively bypass faults or stragglers on the ring topology (preserving good throughput).

To the best of our knowledge, is the first ring-based system to preserve its topology (and availability) despite active faults. In a -nodes system, sustains ops/sec when there are no faults (throughput decay to of its best value). If replicas manifest malicious behavior, then throughput reaches ops/sec ( decay). Since and are research prototypes, we do not evaluate their stability, which we leave for future work.

To summarize, in this paper we investigate how performance decays in agreement protocols when increasing their size. We deploy five SMR systems using at least replicas in a geo-replicated network. We observe that, indeed, throughput decays in these systems as a function of system size, but this decay dampens. Our experiments with chain- and ring-replication show that there are ways to alleviate throughput decay in SMR, informing future designs.

In the rest of this paper, we provide some background, including the SMR systems under our study (Section 2), and then discuss the methodology of our empirical evaluation (Section 3). Next, we present the results of our evaluation on the performance decay of five SMR systems (Section 4). We take a rather unconventional approach of presenting first the evaluation of (in Section 4), and then present its design (Section 5). We also discuss related work (Section 6), and then conclude (Section 7).

2 Background & Motivation

Consensus protocols often employ a layered design. In Paxos terminology, for instance, there is a distinction between proposers, acceptors, and learners (or observers) [van15vive]. Proposers and acceptors effectively handle the agreement protocol, while learners handle the execution of operations (i.e., client requests).

We are interested in the agreement protocol. This is the one typically encountering scalability issues. As mentioned earlier, agreement should ideally execute on tens or hundreds of replicas in certain applications, e.g., for decentralized services, or to ensure that shards are resilient against a Byzantine adversary [Croman16, koko17omniledger, Kermarrec2000, vuko15quest].

When increasing the size of a system executing agreement, some performance degradation is unavoidable. This is inherent to replicated systems that ensure strong consistency, because a higher degree of replication (i.e., fault-tolerance) entails a bigger overhead to agree on each request. But how does performance decay—in a linear manner? Or does the decay worsen or does it lessen when system size increases? Both throughput and latency are vital measures of performance, and it is well known that these two are at odds with each other in SMR systems 

[guerraoui16icg, lamp03lowerbounds]. In this paper we focus on throughput, but our study also covers latency results.

We focus on systems with deterministic guarantees. There is a growing body of work proposing agreement protocols for very large systems. Most of this work, however, offers probabilistic guarantees, e.g., systems designed for cryptocurrencies [eyal16bitcoinng, gilad2017algorand, koko17omniledger], or group communication [Kermarrec2000]. These probabilistic solutions often employ a committee-based design, where a certain subset of nodes executes agreement on behalf of the whole system. The protocol which this subset of nodes typically execute is, in fact, a deterministic agreement algorithm. This can be seen most clearly, for instance, in distributed ledger systems such as ByzCoin [kogi16byzcoin] or Bitcoin-NG [eyal16bitcoinng], where the chosen committee runs a PBFT-like algorithm [castro2002].

ZooKeeper (ZAB [junq11zab]) etcd (Raft [ongaro2014consensus]) BFT-Smart [bessani2014state]  [van04chain] (Section 5)
Synchrony assumptions partial sync. partial sync. partial sync. sync. partial sync.
Fault model; system size N crash; crash; Byzantine; crash; Byzantine;
Communication pattern (overlay) leader-centric leader-centric leader-centric chain ring
Msg. processed by bottleneck node
One-way message delays
Table 1: Overview of the SMR systems in our study. With the tentative execution optimization [castro2002, sousa15wheat]. This is the worst-case delay, but depending on which replica receives the client request, the message delay can be as low as (as we explain later Section 5).

2.1 SMR Systems in Our Study

Our study covers five representative SMR protocols. We give an overview of these system in Table 1, highlighting some of their essential differences. As can be seen, these systems cover two types of synchrony models (partially synchronous, and synchronous), two fault models (crash and Byzantine), and there are three classes of communication patterns (leader-centric, chain, and ring topology).

In terms of message complexity at the bottleneck node, in the first three protocols the leader does two or three rounds of broadcast. In and the load is equally distributed across all replicas: in each replica sends exactly one message per request, whereas each replica in processes messages per request. Finally, the message delays row include communication to and from client. To sum-up, these five systems cover a wide range of design choices. We now discuss each system in more detail.


This system is based on ZAB, an atomic broadcast algorithm [junq11zab], and is implemented in Java. We study ZooKeeper rather than ZAB directly, as ZAB is tightly integrated inside ZooKeeper.


This system is implemented in Go and is based on the Raft consensus algorithm [ongaro14search]. ZAB and Raft share many design points [ongaro2014consensus]. A central feature in both is the existence of a leader replica which guides the agreement protocol through a series of broadcast-based rounds. As our experiments will show, these two systems experience very similar throughput decay (Section 4).

ZooKeeper and etcd are widely used in production and are actively maintained. They have found adoption both in cluster and multi-datacenter (WAN) environments [etcduse, an2015wide, bec11leader, fournierzkacross]. CFT SMR protocols are also relevant to implementing decentralized trust, e.g., in a blockchain. For instance, a version of Hyperledger Fabric uses Apache Kafka as a CFT consensus algorithm [kafkaHL]. These two CFT protocols are interesting in their own right, not necessarily in blockchains, and can also indicate how SMR performs in certain variants of the Byzantine fault model (e.g., XFT model [liu15xft]).


The third system we study, implemented in Java, provides BFT guarantees [bessani2014state]. BFT-Smart is actively used and has been maintained by a team of developers for over eight years, being a default choice for prototyping research algorithms in several groups [bftshome, liu15xft, Visigoth]. BFT-Smart is patterned after the seminal PBFT consensus algorithm of Castro and Liskov [castro2002].

These three systems described so far employ a leader-centric design [bi12spaxos, ongaro2014consensus], i.e., they rely on the leader replica to carry most of the burden in the agreement protocol. Specifically, the leader does not only establish a total-order across operations, but also disseminates (via broadcast) those operations to all replicas. This design simplifies the SMR algorithm [ongaro2014consensus]. The disadvantage is that the leader replica (its CPU or bandwidth) is typically the bottleneck [junq11zab].

We choose these three systems for their maturity. These are production-ready (ZooKeeper and etcd) or seasoned implementations (BFT-Smart). We also study the stability of these three systems, i.e., executions where the leader replica crashes, in addition to their performance. Prototypes, like the next two systems we consider, may deliver better performance, but often do so without vital production-relevant features which can hamper performance.

The fourth and fifth SMR systems we wrote ourselves: , our prototype of chain replication [van04chain], and , a new ring-based replication protocol with BFT guarantees. Both are written in Go. Chain replication, and in particular its ring-based variants, are provably throughput-optimal in uniform networks [guerraoui10ring, jalili17ring]. 111As we discuss later (Section 4 and Section 6), the WAN testbed we use is not uniform, yet these systems show very good performance. In contrast to leader-centric protocols, these systems avoid the bottleneck at the leader because they balance the burden of disseminating operations across multiple system replicas.


We use this system as a baseline, to obtain an ideal upper bound—and what other SMR systems could aim for—both in terms of absolute throughput and throughput decay. We faithfully implement the common-case with pipelining and batching, to optimize performance [ongaro2014consensus, santos2012tuning].

works in a fail-stop model, i.e., assumes synchrony to be able to mask crash faults [van04chain], unlike the other four systems we study in this paper. Solutions to make chain- or ring-based systems fault-tolerant include an external reconfiguration module, or a special recovery mode [aublin15next700, ab13sharding, knezevic12high, van12byzantine]. In such solutions, even simple crashes put the system in a degraded mode, possibly for extended periods of time.


This system represents our effort in designing a BFT protocol optimized for throughput, which can withstand sub-optimal conditions (e.g., faults, asynchrony) and hence offer improved availability. We briefly describe below, and additionally dedicate a section for full details (Section 5).

is a ring-based agreement protocol assuming partial synchrony. When a fault occurs, we mask this by temporarily increasing the fanout at a particular replica. This is in contrast to prior ring or chain designs, which resort to reconfiguration or recovery.

By default, every replica in has a fanout of , i.e., it forwards everything it receives to its immediate successor on ring. In the worst case, consecutive replicas on the ring can be faulty. In this case, the predecessor of all these nodes (a correct replica) increases its fanout to , bypassing all faults. This way, the successor of these faulty nodes still receives all updates propagating on the ring, and progress is not interrupted. The system throughput deteriorates when this happens, but not as badly as that of broadcast-based solutions, where one of the replicas—the leader—has a fanout of (or for BFT) [castro2002, hu10zookeeper].

3 Methodology

We now discuss the testbed for our study (Section 3.1), details of the write-only workload (Section 3.2), as well as the workload suite we use to conduct experiments (Section 3.3).

3.1 Testbeds

We consider as testbed the SoftLayer public cloud platform spanning multiple datacenters [softlayerDCs]. We use virtual machines (VMs) equipped with (virtual) CPU cores and GB RAM. We use low spec-ed VMs to gain insight in SMR scalability on commodity hardware. Each client and replica executes in a separate VM. This separation avoids unnecessary noise in our results, which would happen if client and/or replica processes were contending for local resources.


The bandwidth available between different nodes, either clients or replicas, is not a bottleneck: every VM has Mbps bandwidth. Latencies in SoftLayer range from under ms to almost ms, depending on distance; we consider nine regions of North, Central, and South America. We use ping to measure the inter-regional latencies (which are symmetric), and present our results in Table 2.

WDC 15 22 31 56 60 39 56 115
MON 9 38 61 64 43 63 123
TOR 30 53 56 37 55 124
DAL 40 36 8 25 143
SEA 18 48 65 174
SJC 44 56 195
HOU 30 136
MEX 167
Table 2: Inter-regional RTT (msec) in SoftLayer. The nine regions are: Washington (WDC), Montreal (MON), Toronto (TOR), Dallas (DAL), Seattle (SEA), San Jose (SJC), Houston (HOU), Mexico (MEX), and Sao Paolo (SAO).
Node placement.

As Table 2 illustrates, there is a large disparity in cross-regional latencies. Consequently, replica and client placement across regions can impact performance. By default, we always place all clients in Washington. Spreading clients randomly has no benefit, and would introduce unnecessary variability in the results. For ZooKeeper, etcd, and BFT-Smart, we place replicas randomly across the nine regions.

Replica placement is particularly important for chain- or ring-based systems. For instance, in , client requests propagate from one replica to another, starting from the head of the chain until it reaches the tail; the tail responds to clients. If we distribute replicas randomly, then requests will pass back and forth between regions, accumulating latencies on the order of seconds or worse. Random replica placement would be unreasonable. Instead, successive nodes in the chain should be clustered in the same region, and the jumps across regions should be minimized: i.e., the latency for traversing the chain should be minimal (intuitively, this corresponds to solving the traveling salesman problem).

We take a simple approach to replica placement for and . We start with Washington, and then traverse the continent from East to West and North to South—i.e., counter-clockwise, as follows: (1) Washington, (2) Montreal, (3) Toronto, (4) Dallas, (5) Seattle, (6) San Jose, (7) Houston, (8) Mexico, and (9) Sao Paolo. Each region hosts a random number of replicas between and . While this is not the optimal placement method, it is simple and yields surprisingly good results (Section 4.1

). More complex alternatives exist to our heuristic-based solution, typically based on integer programming 

[wu2013spanstore], or iterative optimization algorithms [agarwal2010volley].

To conclude, this multi-datacenter deployment mimics a real-world global deployment, where each datacenter is effectively a city (with intra-city latencies being negligible). This is a standard experimental setup [gilad2017algorand, rocket18avalanche, gueta2018sbft]. Note that placing clients in Washington does not give any advantage to and , since each client request has to traverse the whole chain (or ring).

Operating system and software.

All machines in our study run Ubuntu x64. For Apache ZooKeeper, we use  [zkpkg]. We install etcd directly from its repository, and BFT-Smart  [bftspkg].

3.2 Workload Characteristics

The goal of our workload is to stress the central part of the systems under study—their underlying agreement protocol, which is typically their bottleneck. In practice, this protocol is also in charge of replicating the request data (i.e., blocks of transactions) to all replicas. Clients produce a workload consisting of write-only requests, i.e., we avoid read-only optimization such as leases [cha07pml]. Each request has bytes (inspired from a Bitcoin workload [Croman16]).

Local Handling of Requests.

Requests are opaque values which replicas do not interpret, consisting of a constant string. The execution step consists of simply storing the data of each request in memory. Recall that our main goal is to find how system size influences throughput decay. For this reason, it is desirable to exclude features that are independent of system size, which typically would incur a fixed overhead (or amortization, in the case of batching). Such features (e.g., persistence layer, execution) are application-dependent, are often embarrassingly parallel, and moreover optimizations at this level [kapri12eve] are orthogonal to our study.

For ZooKeeper and etcd, we mount a tmpfs filesystem and configure these systems to write requests to this device. In BFT-Smart we handle requests via a callback, which appends every request to an in-memory linked list. and simply log each request to an in-memory slice (i.e., array).


We do not optimize the batching configuration. This is because different system sizes require different batch sizes to optimize throughput [miller2016honey]. Moreover, batching is often an implementation detail hidden from users, e.g., etcd hardcodes the batch size to MB [etcdbatching]. Similarly, batching in ZooKeeper is not configurable; this system processes requests individually, and it is unclear whether batching is handled entirely at the underlying network layer (Netty). In BFT-Smart we use batches of , the default. In and we are more conservative, allowing up to requests per batch, since these systems are already throughput-optimized in their dissemination scheme.

It is well-known that batching affects the absolute throughput of a system. We are primarily interested, however, by the throughput decay function of SMR systems, and not to maximize absolute throughput numbers. Prior work has shown that batching does not affect the throughput decay, e.g., in BFT SMR systems [cowling2006hq]. For this reason, we expect that the throughput decay in each system evolves independently of batch size.

3.3 Workload Suite

Our workload suite has two parts. We use (1) a workload generator that creates requests and handles the communication of each client with the service. We also use (2) a set of scripts to coordinate the workload across all clients and control the service-side (e.g., restart service between subsequent experiments). The workload generator differs for every SMR system, since each system has a different API. The coordinating side is a common part which we reuse.

The main components of the workload generator are a client-side library, which abstracts over the target system, and a thread pool, e.g., using the multiprocessing.Pool package in Python, or java.util.concurrent.Executors in Java. The thread pool instantiates parallel workload generators from a given client node. For each of ZooKeeper, etcd, , and , we implement the workload generator in a Python script. BFT-Smart is bundled with a Java client-side library; accordingly, for this system, we write the workload generator in Java.

As mentioned earlier, all client nodes are placed in the same region: Washington. We use VMs, each hosting a client, and each client runs the workload generator instantiated with a predefined number of threads. Depending on the target system and its size, we use between and threads per client to saturate the system and reach peak throughput. Beside the number of threads, the workload generator accepts a few other parameters, notably, the IP and port of a system replica (this is the leader in ZooKeeper, etcd, or BFT-Smart; the head of the chain for ; or a random replica for ); the experiment duration ( seconds by default); or the size of a request ( bytes in our case).

It is important that clients synchronize their actions. For instance, client threads should coordinate to start simultaneously. Also, we restart and clean-up the service of each system after each experiment, and we also gather statistics and logs. The scripts for achieving this are common among all systems. We use GNU parallel [tange2011gnu] and ssh to coordinate the actions of all the clients. To control the service-side, we use the Upstart infrastructure [upstart].

4 Performance Decay Study

We now present our observations on the performance decay of five SMR protocols. We break this section in two parts: performance (Section 4.1) and stability results (Section 4.2).

As mentioned before, we use clients, each running multiple workload threads. Upon connecting to a system replica, clients allow for a seconds respite to ensure all connections establish correctly. Then all client threads begin the same workload. Each execution runs for seconds (excluding a warm-up and cool-down time of 15 seconds each) and each point in the performance results is the average of executions. For stability experiments we use executions of seconds, with a maximum of up to seconds.

Figure 1: Throughput decay for five SMR systems on a public wide-area cloud platform. For enhanced visibility, we use separate graphs for each system. We also indicate actual throughput values alongside data points. Notice the different axes.

4.1 Performance of SMR at 100+ Replicas

We first discuss throughput (Section 4.1.1), and then latency (Section 4.1.2).

4.1.1 Throughput

We report on the peak throughput value, i.e., throughput when the system begins to saturate, before latencies surge, in requests per second (rps). We compute this as the sum of the throughput across all

clients. We also plot the standard deviation, though often this is negligible and not visible in plots.

Figure 1 presents our results. For readability, we also indicate throughput values on most data points.

ZooKeeper, etcd, and BFT-Smart.

For ZooKeeper and etcd, which are CFT, we start from a minimum of replicas and then grow each system in increments of until we reach replicas. For BFT-Smart, we start from replicas—the minimum configuration which offers fault-tolerance in a BFT system—and we use increments of replicas up to . Since we use different increments and start from different system sizes, the x-axes in Figure 1 differ slightly across these systems. In terms of fault thresholds, for CFT systems, and for BFT-Smart.

Intuitively, there are good reasons to believe that running agreement with more than a few replicas should be avoided. Every additional replica participates in the consensus algorithm, inflating the complexity. This complexity, e.g., is quadratic in the number of replicas for BFT systems [cowling2006hq, Kermarrec2000], raising the belief that SMR deployments should always stay small in their “comfort zone.”

Prior findings on throughput decay are scarce and at small scale; briefly, these suggest that leader-centric protocols decay very sharply, e.g., up to  [bessani2014state, cowling2006hq, hu10zookeeper], and consequently should be avoided. For instance, throughput in a LAN deployment of ZooKeeper shows that each increase by in system size incurs a decay by at least rps (and up to [hu10zookeeper, §5.1]. The same findings are echoed in experiments with PBFT on a LAN [abd05quorumupdate, §4.2] or the wide-area network [miller2016honey, §5.2]. Under a sustained sharp decay, throughput would drop to before reaching a few tens of replicas.

Our findings complement earlier observations at smaller scale. Consistent with previous findings, we observe from Figure 1 that the sharpest decline in throughput is at small system sizes. What happens, however, is that the decay trend tapers off. Notably from replicas onward, throughput decays more and more gracefully.

Upon closer inspection, throughput decays at a non-linear rate, roughly inversely proportional to system size . Analytically, the slopes for leader-centric systems approximate a function. These findings confirm an intuitive understanding of these protocols. As discussed, ZooKeeper, etcd, and BFT-Smart are leader-centric protocols. The leader has a certain (fixed) processing capacity, say , equally divided among the other system nodes, yielding a decay rate. These systems are bound by the capacity at the leader—typically CPU or bandwidth. In our case, WAN links are not saturated; these systems are bound by leader CPU, since we use relatively small VMs [hu10zookeeper, jalil14practical].

Throughput decay dampens as we grow each system because every additional replica incurs an amount of work depending on the current system size: adding a replica when is more costly than adding a replica when , given that there are some fixed processing overheads at the primary which get amortized with system size. We also note that in a larger system there are more tasks (such as broadcast) executing in parallel. Finally and most importantly perhaps, throughput saturates at higher latencies when the systems are larger (see Section 4.1.2), since the processing pipeline depends on more replicas, i.e., as each system grows, there is a tendency to trade latency for throughput.

We observe that absolute throughput numbers at replicas are in the same ballpark for these three SMR systems, ranging from to rps. As a side note, this is the current peak theoretical throughput of Bitcoin, suggesting that we can use SMR effectively in mid-sized blockchains, e.g., replicas. If we extrapolate from our observation on the decay rate, it follows that these systems can match Bitcoin peak throughput at about replicas. This is an interesting observation, as the Bitcoin network has about the same size (circa late 2016 [Croman16]). We interpret this as a simple coincidence, however.

Interestingly, BFT-Smart almost matches the performance of its CFT counterparts. This suggests that BFT SMR protocols scale relatively as well as CFT protocols, regardless of typical quadratic communication complexity of BFT. All three systems were surprisingly stable when running at scale, showing consistent results and consequently the standard deviations bars are often imperceptible in our plots.


We deploy using the node placement heuristic presented earlier (Section 3.1). The throughput evolution results for this system are in the fourth row of Figure 1; please note the y-axis range. As expected, preserves its throughput very well with increasing system size [aublin15next700]. So long as the system stays uniform—i.e., without reducing replica computing performance or bandwidth—the throughput degrades very slowly and at a linear rate. At , the system sustains of the throughput it can deliver when . This is not entirely surprising, as adding replicas to does not increase the load on any single node (in the first order of approximation), including the leader (i.e., the head replica).

To conclude, chain replication maintains throughput exceptionally well. This system, however, sacrifices availability in the face of faults or asynchrony. Additionally, the chain is as efficient as its weakest link. Indeed, we repeatedly encountered in our experiments cases with zero throughput. Most often, this happened due to a replica crashing, but also in a few cases due to misconfiguration of a successor, therefore leaving the tail node unreachable.


This ring-based system can maintain availability despite faulty (or straggler) nodes. Accordingly, we have two measurements: (1) a common case showing the throughput during well-behaved executions, and (2) a sub-optimal (faulty) case when faults manifest. Since tolerates up to one-fifth faulty replicas [mar06fab], we set . The faulty replicas occupy successive positions on the ring topology, and collaboratively they aim to create a bottleneck in the system. They do so by activating fallback paths on the ring structure. Concretely, they request from the same correct replica—the target—to accept traffic from each of them and pass that traffic forward on the ring. The target is the successor of the last faulty node. To make matters worse, the target is also the leader replica (we call this the sequencer, described in Section 5.1). We note that faulty replicas might as well stop propagating updates; but this has a lesser impact on throughput, as the target node would need not process the messages from all faulty nodes. This scenario is among the worst that can happen in terms of throughput degradation to , barring a full-fledged DDoS attack or a crashed leader (in the latter case progress would halt in any leader-based SMR algorithm).

The results are in the fifth row of Figure 1. The throughput decays similarly in the good and faulty cases; in absolute numbers there is a difference of rps at every system size. Another way to look at it is that faulty nodes cause on average loss in throughput. degrades less gracefully than , as it includes additional BFT mechanism (Section 5.1). Nevertheless, the decay rate in follows a linear rate, and comes within of throughput (which we regard as ideal, assuming fault-free executions). In contrast to leader-centric solutions, the chain- and ring-replication systems avoid the bottleneck at the leader and, as expected, exhibit less throughput decay, having more efficient dissemination overlays.

4.1.2 Latency

For latency, we report on the average latency at peak throughput, as observed by one of the clients. Since all clients reside in Washington and connect to the same replica, they experience similar latencies. The exception to this is in , where clients connect to random replicas of the system; to be fair, we present the latency of a client connecting to a replica in Washington. The results for all five systems are in Figure 2. At small scale, CFT systems (ZooKeeper and etcd) exhibit latencies on the order of tens of milliseconds. In contrast, BFT-Smart entails an additional phase (round-trip) in the agreement protocol for every request, which translates into higher latencies; batching, however, helps compensate for this additional phase in terms of throughput.

Figure 2: Average latencies (at peak throughput) for scale-out experiments with SMR systems.

We remark on the high latency of . This is to be expected, since this system trades throughput for latency, but it is also amplified by an implementation detail. Specifically, each client runs an HTTP server, waiting for replies from the tail (a distant replica in Sao Paolo). The server is based on the cherrypy framework (written in Python, and unoptimized). At low load, the latency in is similar to . But in clients create a larger volume of requests to saturate the system and also run the HTTP server, elevating the load and latency on each client. (In fact, in an earlier version of , clients were the bottleneck.)

The average latency across all SMR systems does not surpass seconds. We only include the latency for the good case of , but even in the faulty case, latency does not exceed s. In terms of 99th percentiles, the worst cases are s for BFT-Smart (), and s for the faulty-case of ().

4.2 Stability experiments

Our goal here is to evaluate if mature SMR systems recover efficiently from a serious fault when running at large scale. Concretely, we crash the leader replica (triggering the leader-change protocol) and measure the impact this has on throughput. We cover ZooKeeper, etcd, and BFT-Smart; the other systems (and ) have no recovery implemented.

Figure 3: Stability experiments for ZooKeeper and etcd (), as well as for BFT-Smart ().

When the leader in an SMR system crashes, this kicks off an election algorithm to choose a new leader. We study the stability of this algorithm—whether it works at scale and how much time it requires. These tests are rather about the code maturity in these systems rather than their algorithmic advantages. Our results are in Figure 3, describing two runs: (1) a healthy case, and (2) a case where we crash the leader (i.e., primary). During the first s clients simply wait, and then they start their workload; we crash the leader s later. The point where we crash the leader is obvious, around s, as the throughput drops instantly to zero.

ZooKeeper has consistently the fastest recovery. The throughput reaches its peak within just a few seconds after the leader crashes, consistent with earlier findings at smaller scale [shraer12dynamic]. This system has an optimization so that the new leader is a replica with the most up-to-date state, which partly accounts for fast recovery [ongaro2014consensus]. In etcd recovery is slower: It can take up to seconds for throughput to return to its peak. The election mechanism in etcd is similar to that of ZooKeeper [ongaro2014consensus], and the heartbeat parameters of these two systems are similar as well ( and seconds, respectively). The difference in stability between ZooKeeper and etcd can also stem from an engineering aspect, as the former system has a more stable codebase (started in 2007) compared with the latter system (2013).

For BFT-Smart, we allowed the system up to seconds to recover but throughput remained . We also tried with smaller sizes, and found that is the largest size where BFT-Smart manages to recover, after roughly . Finally, we remark that we were able to reproduce all these behaviors across multiple (at least ) runs, so these are not outlying cases.

5 Protocol

An interesting design aspect of is how the ring overlay masks faults. We achieve this by keeping redundant paths on this overlay, thus ensuring availability despite faults or asynchrony. The agreement protocol of is of interest as well, which is the FaB consensus algorithm [mar06fab] adapted to a ring overlay.

We choose to pattern the agreement protocol of after FaB [mar06fab] 2-step BFT consensus algorithm due to the interesting tradeoff this offers. FaB reduces the latency of BFT agreement from three steps to two. This is appealing in our ring topology, because each step is a complete traversal of the ring. Fewer traversals means higher throughput, lower latency, and a simpler protocol than 3-step ones [castro2002]. This benefit comes to the detriment of resilience: the system needs larger quorums to tolerate faults. assumes replicas, whereas optimal BFT systems tolerate one-third faults, i.e.,  [castro2002]. As in prior solutions, our agreement protocol relies on the existence of a sequencer (i.e., a leader) assigning sequence numbers to operations.

Figure 4: Overview of in a system of replicas. Clients interact with the system via a thin API. Underlying this API, there is a reliable broadcast scheme executing along a ring overlay network. In this overlay, each replica has a successor and predecessor. There is a specific replica which carries the role of a sequencer (replica here).

Figure 4 shows an overview of when . The replicas, labeled , are organized on a ring. Note that each replica in this overlay has a certain successor and predecessor replica. One of the replicas (node ) is the sequencer. By default, broadcast messages disseminate through the system from one replica to its immediate successor. Additionally, fallback (redundant) paths exist, to ensure availability despite asynchrony or faults.

The interface of is log-based: Each replica exposes a simple API which allows clients to read from a totally-ordered log and to Append new entries to this log. Under this API layer, all replicas implement a reliable broadcast primitive providing high throughput and availability. We discuss the Append operation first (Section 5.1), then the reliable broadcast layer (Section 5.2), followed by the reconfiguration sub-protocol (Section 5.3) and correctness arguments (Section 5.4).

5.1 Append Operation

To add an entry e to the totally-ordered log, clients invoke Append(e) at any replica . The operation proceeds in two logical phases:

  1. Data—Replica broadcasts entry e to all correct replicas using RingBroadcast of the underlying broadcast layer.

  2. Agreement—A BFT agreement protocol executes. The sequencer proposes a sequence number (i.e., a log position) for entry e, and correct replicas confirm this proposal. After executing the agreement phase for this entry, replica i notifies the client that the operation succeeded.

LABEL:lst:total-order-protocol shows the implementation of the Append operation. First, replica broadcasts a message, as shown on line LABEL:line:disseminate. This corresponds to the first logical phase of the Append operation. We say that this broadcast message is of type data and has payload e. As this message disseminates throughout the system, each correct replica triggers the RingDeliver callback to deliver the data message (line LABEL:line:callback of LABEL:lst:total-order-protocol).

The delivery callback always provides two arguments: (1) an identifier id, and (2) the actual message. The underlying broadcast layer assigns the id, which uniquely identifies the associated message. We discuss identifiers in further detail later, but suffice to say that an identifier is a pair denoting the replica which sent the corresponding message plus a logical timestamp for that replica (Section 5.2). The actual message, in this case, is a data message with payload entry e. Upon delivery of any data message, each correct replica stores this entry in a pending set. Note that this set is indexed by the assigned id (line LABEL:line:pending).

The agreement phase starts when the sequencer replica delivers the data message with e. After saving e in the pending set, the sequencer also proposes a sequence number for e by broadcasting a message (line LABEL:line:send-seq-1 and lines LABEL:line:send-seq-2LABEL:line:send-seq-3). It would be wasteful (in terms of bandwidth) to include the whole entry in this agreement message; instead, the sequencer simply pairs the entry id with a monotonically increasing sequence number sn (called nextSeqNr on line LABEL:line:send-seq-3). The hash in the agreement message is computed on the concatenation of the assigned sequence number, the entry id, and the entry e itself.

Listing 1: A high-level algorithm describing the Append operation.

Each replica delivers the agreement message of the sequencer trough the same RingDeliver callback of the broadcast layer. After any replica delivers the agreement message, then broadcasts its own agreement message with the same triplet (sn, id, hash), as described on line LABEL:line:send-seq-4. Prior to doing so, replica  validates the hash and saves the assigned sequence number and hash in the pending log (lines LABEL:line:pending-1LABEL:line:pending-2).

The validation step (line LABEL:line:validate) refers to several important checks: that the hash correctly matches the sequence number sn, identifier id, and entry content; that this id has no other proposal for a prior sequence number; that each number sn in an agreement message has a corresponding proposal for that sn from the sequencer replica; and that different agreement messages confirming the same sn originate from distinct replicas.

We say that a replica commits on log entry e after it gathers sufficient () confirmations for that entry. The entry then becomes stable at that replica (line LABEL:line:stable). By the reliability of the RingBroadcast dissemination primitive, if the sequencer is correct and proposes a valid , then the entry eventually becomes stable at all correct replicas. As proved in prior work [mar06fab], this protocol has optimal resilience for two-step BFT agreement, i.e., is the smallest system size to tolerate faults with two-step agreement.

Informally, as the initial agreement message (from the sequencer) propagates on the ring, it produces a snowball effect: Each replica delivering this message broadcasts its own agreement message, confirming the proposed sequence number. Figure 5 depicts this intuition. Notice that the data and agreement phases partially overlap (beginning from replica ). Overall, it takes message delays for replica to respond to the client request. In terms of message complexity, each replica processes messages. In other words, there are total invocations to RingBroadcast: one with the data message, plus agreement messages (one per replica).

Figure 5: The unfolding of an Append(e) operation in . The client contacts replica , which broadcasts a data message. Once this message reaches the sequencer (replica ), this replica broadcasts an agreement message. Then all replicas broadcast their agreement message. Entry e becomes stable once a replica gathers at least matching agreement messages for e.

5.2 Reliable Broadcast in a Ring Topology

In a conventional ring-based broadcast, each replica expects its predecessor to forward each message which delivers [guerraoui10ring, jalili17ring]. Messages travel from every replica to that replica’s successor. Intuitively, this scheme is throughput-optimal because it balances the burden of data dissemination across all replicas [guerraoui10ring]; the downside, however, is that asynchrony (or fault) at any replica can affect availability by impeding progress. To maintain high-throughput broadcast across the ring overlay despite asynchrony or faults, in we strengthen the ring so that each replica connects with total predecessors. A replica has one default connection—with the immediate predecessor—and up to fallback connections—with increasingly distant predecessors. The topology we obtain is essentially an -connected graph. This graph ensures connectedness (availability) despite up to faults.

In Figure 4 for instance, replica should obtain from replica all messages circulating in the system. If replica disrupts dissemination and drops messages, however, then replica can activate the fallback connection to replica . By default, communication on this fallback path is restricted to brief messages called

state vectors

, which replica periodically sends directly to replica .

More generally, replica expects a state vector from all of its fallback predecessors, i.e., from replicas . A state vector is a concise representation of all messages delivered by the corresponding fallback replica. If replica notices that its immediate predecessor is omitting messages (and that the state on fallback replica is steadily growing), then sends a message directly to replica , where is the state vector of replica .

Replica interprets the activate message by sending to replica all messages which has delivered and are not part of state vector . Thereafter, replica continues sending to any new messages it delivers. In the meantime, replica also continues forwarding messages as usual to its immediate successor replica . In case replica restarts acting correctly and forwards messages to , then replica can send a message to .

Alternatively, replica can request individual pieces of the state from , e.g., in case replica is selectively withholding messages from . Since every replica has total connections, can tolerate up to faults, regardless whether these faults are successive on the ring or dispersed across the system. This mechanism based on fallback connections is strictly to improve availability (i.e., delivery of broadcast messages) relying on timeouts, but does not affect the safety of the protocol (Section 5.4).

The state vector at some replica is a vector of timestamps with one element per replica. The element on position denotes the latest message broadcast by replica which replica delivered. Concretely, each such element is a timestamp, i.e., a logical counter attached to any message which a replica sends upon invoking RingBroadcast. For instance, whenever replica calls RingBroadcast(m), the broadcast layer tags message with a unique id, in the form of a pair , where denotes the sender replica and is a monotonically increasing timestamp specific to replica 222Just like sequence numbers, timestamps are dense [castro2002]: This prevents replicas from exhausting the space of these numbers and makes communication steps more predictable, which simplifies dealing with faulty behavior [aiyer05bar]. As we explained earlier, the id also plays an important role in the Append operation (Section 5.1).

When replica delivers message with id from replica via the RingDeliver callback, replica updates its state vector to reflect timestamp for position j. Status vectors are inspired from vector clocks [fid88times, matt88virtual]. The notable difference to vector clocks is that a replica does not increment its own timestamp when delivering or forwarding a message: This timestamp increments only when a replica sends a new message—typically a data or a agreement message—by invoking RingBroadcast. With state vectors, our goal is not to track causality or impose an order [guerraoui10ring], but to ensure no messages are lost (i.e., reliability).

One corner-case that can appear in our protocol is when a malicious sender replica attempts to stir confusion using incorrect timestamps. In particular, such a replica can attach the same timestamp to two different messages: with and with id . Another bad pattern is skipping a step in the timestamp by broadcasting first and then . In practice, communication between any two replicas relies on FIFO links (e.g., TCP), so correct replicas can simply disallow—as a rule—gaps or duplicates in messages they deliver.

FIFO links, however, do not entirely fix the earlier problem. It is possible that some replica delivers with timestamp , while a different replica delivers for this same timestamp. But note that this will not cause safety issues. The replicas can only agree on either of or : When the sequencer proposes a sequence number for id , it also includes a hash of the corresponding message, either or . Even if the sequencer is incorrect and proposes sequence numbers for both and (a poisonous write [mar06fab]) only one of these two messages can gather a quorum and become stable. To conclude, timestamps restrict acceptable behavior and—when coupled with the sender replica’s identity—provide a unique identifier for all messages circulating in the system, which helps ensure reliability.

5.3 Reconfiguration

The reconfiguration sub-protocol in ensures liveness when the sequencer replica misbehaves. This protocol is, informally, an adaptation of the FaB [mar06fab] recovery mechanism (which is designed for an individual instance of consensus) to state machine replication.

5.3.1 Preliminaries

Reconfiguration concerns the agreement algorithm in . Neither the data phase (of the Append operation) nor the broadcast layer need to change. We adjust the common-case protocol of Section 5.1 to accommodate reconfiguration as follows:

  1. Replicas no longer agree on a sequence number for every Append operation. Instead, each instance of the agreement protocol for a given sequence number is tagged with a configuration number, so that replicas now agree on a tuple . In other words, agreement messages now have the form . The concept of proposal numbers in FaB [mar06fab] or that of view numbers in PBFT [castro2002] is analogous to configuration numbers in ; briefly, these serve the purpose of tracking the number of times the sequencer changes.

  2. We modify the in agreement messages to also include the configuration number.

  3. Our common-case protocol assumes that, for every sequence number , a replica only ever accepts (i.e., gives its vote for) a single agreement message, namely, the first valid agreement message they deliver for that from the sequencer. To account for configuration numbers, a replica is now allowed to change its mind, and accept another agreement message if the sequencer changed (i.e., in a different configuration number).

Configuration numbers in start from . Every time the configuration number increases, the sequencer role changes deterministically, so that the new sequencer is the successor of the previous sequencer, in a round robin manner.

We note that the ring-based dissemination algorithm (Section 5.2) that replicas employ during common-case requires no modifications. While executing reconfiguration (described next), however, replicas do not use the ring-based broadcast primitive. If reconfiguration is executing, this means that the current sequencer is faulty and hence the system is experiencing no progress. For this reason, we can temporarily renounce on the high-throughput broadcast enabled by the ring topology, and adopt instead a conventional all-to-all broadcast scheme towards optimizing for latency.

5.3.2 Protocol

A correct replica enters the reconfiguration sub-protocol if any of the following conditions hold: (1) a timer expires at replica because the sequencer replica failed to create new agreement messages or a previous reconfiguration failed to complete in a timely manner, or (2) replica observes other replicas proposing reconfiguration.

To propose a reconfiguration and change the sequencer, replica broadcasts a i message. Once it does so, replica also starts ignoring any messages concerning configuration number or lower, and until reconfiguration completes it ignores any messages except those of type data, reconfiguration, or new-configuration (as we define them below). Replica also starts a timer to prevent a stalling reconfiguration.

The set in a reconfiguration message contains agreement messages which replica signed for all the sequence numbers which are still pending at this replica. In other words, these are proposals for entries which are not part of the stable log at replica because this replica gathered insufficient confirmations (i.e., less than ) to mark the corresponding entry as stable.

The successor of the faulty (old) sequencer, namely the replica at position on the ring, where , is set to become sequencer when reconfiguration completes. This new sequencer waits until it gathers reconfiguration messages from replicas, including itself, and then broadcasts a  message. Here,  represents the set of reconfiguration messages which the new sequencer gathered.

Reconfiguration completes at replica when delivers the new-configuration message. When this happens, replica starts accepting agreement messages created by the new sequencer (i.e., the replica at position on the ring) and expects these messages to be tagged with configuration number . After reconfiguration completes, the new sequencer starts redoing the common-case agreement protocol for every individual sequence number found in an agreement message in the set .

For every sequence number appearing in , the new sequencer creates a new agreement message with the configuration number set to . When creating these new agreement messages, for every sequence number there are two cases to consider:

(1) If there exists an that appears in an agreement message of , such that there is no with agreement messages (with the same ) in , then the new sequencer chooses this to be associated to in the new agreement message. The new sequencer also recomputes the hash as done in the common-case protocol, including the new configuration number . In FaB terminology [mar06fab], we say that the set  of reconfiguration messages vouches for to be associated to .

(2) Alternatively, it can happen that more than one pair is vouched for by (or no such pair at all). This can occur, for instance, if the previous sequencer was malicious and proposed for the same sequence number multiple different s. In this case, the sequencer can choose to associate with any that was previously proposed for , and recomputes the hash as done in the common-case. For every such new agreement message that the new sequencer creates, the common-case protocol executes as described in Section 5.1, accounting for the configuration number .

5.4 Correctness Arguments

Safety. provides three safety guarantees. (1) Only values (i.e., entry s) that are proposed by replicas can become stable; (2) Per sequence number, only one value can become stable; (3) The same value cannot become stable at two different sequence numbers.

Recall that the agreement algorithm in is patterned after the FaB protocol. Specifically, we rely on the correctness of this protocol, so properties (1) and (2) follow from the safety properties of FaB (CS1 and CS2, respectively) [mar06fab, §5.4].

Property (3) follows from the fact that correct replicas index values by their identifier id (see lines LABEL:line:pending and LABEL:line:indexing in LABEL:lst:total-order-protocol); hence each value will be associated to a unique sequence number, and no correct replica will accept or confirm any value that was previously proposed for a different sequence number.

Liveness. Our system ensures the following liveness properties. (1) at any point in time, if there exists a value proposed by a correct replica which is not part of the stable log yet, then eventually one value (possibly another one) will become stable, (2) if there are only a finite number of proposed values, then all of them eventually become stable.

(1) Assume there is a value (entry id) which is not part of the stable log yet. Let be the first sequence number for which no value was accepted. If the current sequencer is correct, then it will propose a unique value for sequence number (either value , or another one), and the replicas will reach agreement on that value.

If the replicas are not able to eventually reach agreement for sequence number , this means that the current sequencer is faulty, and they will trigger a reconfiguration. The new sequencer might be faulty as well, but after enough reconfigurations (at most ), a correct sequencer is chosen. This new sequencer will then propose a unique value for sequence number that will become accepted by all (correct) replicas, and accepted to the stable log.

(2) We apply the first liveness property as many times as there are proposed values. Together with safety, this ensures that every operation that needs to be ordered eventually becomes stable.

6 Discussion & Related Work

Several strongly-consistent systems, including those based on chain- or ring-replication, have been designed assuming a synchronous system model with uniform replica and network characteristics [aublin15next700, knezevic12high, van04chain]. By using an efficient communication pattern (i.e, load-balancing the agreement algorithm) these systems exhibit the best performance decay. These protocols seem to excel in synchronous models [guerraoui10ring]—and forfeit availability otherwise. In a WAN (or multi-datacenter), uniformity or synchrony is unlikely. We introduce the protocol as a first step in trying to bridge the gap between good performance decay and good availability.

We remark on two concrete research directions to help further the goal of reconciling performance decay and availability in agreement protocols. First, it is appealing to combine broadcast with ring-based protocols in a single system, in the vein of Abstract [aublin15next700]. Such a system could provide higher BFT resilience, namely tolerate faults, unlike where resilience is . Combining protocols typically yields intricate systems, however, and it is important to address the resulting complexity. Second, for predictable behavior, faulty nodes should be detected and evicted from the system in a timely manner. This is challenging in a Byzantine environment. Past approaches rely on proofs-of-misbehavior (which are useful for limited kinds of faults [kot07zyz]) or on incentives (which tend to be complex [aiyer05bar]), so new solutions or practical assumptions are needed.

Chain- or ring-replication is a well-studied scheme in SMR. Our baseline, , is a straightforward implementation of chain replication in the fail-stop model, shedding light on what is an ideal upper bound of SMR throughput in fault-free executions. In contrast to , previous approaches consider a model that does not tolerate Byzantine faults [amir05spread, and09fawn, jalili17ring], or they degrade to a broadcast algorithm to cope with such faults [aublin15next700, knezevic12high].

The technique of overlapping groups of chain replication on a ring topology, as employed in FAWN [and09fawn], is similar to basic ring-based replication. The insight is similar: instead of absorbing all client operations through a single node (a bottleneck), accept operations at multiple nodes. The most important distinction between FAWN and lies in the use of sharding. FAWN shards the application state, each shard mapping to a chain replication group. As we mentioned earlier, sharding is a common workaround to scale SMR [bezerra2016strong, glenden11scatter, co13spanner]. In and other systems we study, the goal is full replication. Even when sharding is employed, our findings are valuable because they apply to the intra-shard protocol (which is typically an SMR instance).

S-Paxos [bi12spaxos] decouples request dissemination from request ordering. This relaxes the load at leader and can increase throughput. We apply this same principle in , by separating dissemination (ring-based broadcast for high throughput) from agreement (FaB). In contrast to , S-Paxos does not tolerate Byzantine faults.

Building upon a conjecture of Lamport [lamp03lowerbounds], FaB [mar06fab] laid the fundamental groundwork for 2-step BFT consensus. The agreement algorithm in is a simplified FaB protocol (e.g., we use only one type of protocol message, , for reaching agreement), but the most important distinction to FaB is that employs a ring topology in the common-case, eschewing the throughput bottleneck at the leader. We believe the FaB agreement algorithm combined with dissemination schemes from the chain- or ring-based families [guerraoui10ring, jalili17ring, van04chain] deserve more attention.

It was recently uncovered that a version of the FaB protocol is flawed. Namely, that this protocol version suffers from liveness issues, which can happen when a malicious leader engages in a poisonous write [abrah17revisiting]. This problem, however, only applies specifically to the parametrized version of FaB [mar06fab], called PFaB in [abrah17revisiting]. PFaB is not the same protocol as the one we use in , hence is not subject to the liveness problems of PFaB.

It is difficult—if not impossible—to be comprehensive in studying agreement protocols, given the abundance of work on this topic. In choosing the five systems in this paper, our goal was to include solutions that are as diverse as possible in their design while also being broadly applicable. Many interesting agreement protocols exist that build on various assumptions to speed performance and allay their throughput decay, for instance by relying on correct client behavior as in Zyzzyva [kot07zyz], exploit application semantics as in EPaxos [mora13epaxos], considering certain restricted data types as in the Q/U protocol [abd05quorumupdate], or client speculation [weste09latency].

Our stability experiments show that BFT SMR protocols, even as mature as BFT-Smart, are not as battle-tested as CFT protocols. Indeed, open-source implementations of BFT SMR are scarce, and the issue of

non-fault-tolerant BFT protocols is known [clement09making]. In fact, it is a pleasant surprise for us that BFT-Smart is able to go through reconfiguration at replicas (despite folding at bigger system sizes), and resume request execution after the leader fails.

An important application of Byzantine fault-tolerant SMR is for permissioned distributed ledger applications [hyperledger, sou18byzantine]. In such an application, SMR can serve as an essential sub-system ensuring a total-order across the entries in the ledger. There is a growing body of work dealing with the problem of scaling consensus for distributed ledgers, which we cover briefly.

SBFT is a recent BFT system showing impressive performance, e.g., at least rps when (i.e., [gueta2018sbft]. To obtain a scalable solution, this system explores modern cryptographic tools (plus other optimizations), a design dimension we did not explore in this work. Another practical method to scale agreement to a large set of nodes is to elect a small committee (or even one node), and run the agreement protocol in this committee. This method appears in Algorand [gilad2017algorand], HoneyBadger [miller2016honey], ByzCoin [kogi16byzcoin], or Bitcoin-NG [eyal16bitcoinng], among others. Most research in this area seeks to ensure probabilistic guarantees, whereas we consider deterministic solutions. Nevertheless, our findings in this paper apply to most of these protocols, because typically the committees in these systems run a deterministic protocol, e.g., PBFT [castro2002, kogi16byzcoin].

A specific step that is common to all SMR protocols is the processing of protocol messages and client requests at each replica. Several approaches can optimize this step. These are all orthogonal to our study, and they generally apply to any SMR system. Examples include optimistic execution to leverage multi-cores [kapri12eve] or hardware-assisted solutions [poke15dare], e.g., to speed costly crypto computations, offload protocol processing [istva16consbox], or using a trusted module to increase resilience and simplify protocol design [chun07attested].

7 Conclusions

It is commonly believed that throughput in agreement protocols should degrade sharply with system size. The data supporting this belief is scarce, however, and simple extrapolation cannot tell the whole story. Consequently, in this paper we have studied empirically the performance decay of agreement in five SMR systems.

A positive takeaway from our study was that mature SMR implementations (ZooKeeper, etcd, and BFT-Smart) can sustain, out-of-the-box, rps (requests per second) at replicas. Their throughput decays rapidly at small scale (consistent with previous findings), but this decay dampens at larger size. These systems exhibit a non-linear (inversely proportional) decay rate. Throughput decays very slowly (linearly) in chain-replication, and this system can sustain rps at replicas. comes within performance of the ideal (chain-replication). This suggests that SMR can be effectively employed in mid-sized applications, e.g., up to hundreds of replicas, and there should be more focus on chain- or ring-based topologies for mitigating agreement protocols performance decay.