Shallow Overlay Trees Suffice for High-Throughput Consensus

by   Hao Tan, et al.
University of Waterloo

All-to-all data transmission is a typical data transmission pattern in blockchain systems. Developing an optimization scheme that provides high throughput and low latency data transmission can significantly benefit the performance of those systems. In this work, we consider the problem of optimizing all-to-all data transmission in a wide area network(WAN) using overlay multicast. We prove that in a congestion-free core network model, using shallow broadcast trees with heights up to two is sufficient for all-to-all data transmission to achieve the optimal throughput allowed by the available network resources.



page 1

page 2

page 3

page 4


Lawn: an Unbound Low Latency Timer Data Structure for Large Scale, High Throughput Systems

As demand for Real-Time applications rises among the general public, the...

Fast Compilation and Execution of SQL Queries with WebAssembly

Interpreted execution of queries, as in the vectorized model, suffers fr...

Hemlock : Compact and Scalable Mutual Exclusion

We present Hemlock, a novel mutual exclusion locking algorithm that is e...

In-Network Freshness Control: Trading Throughput for Freshness

In addition to traditional concerns such as throughput and latency, fres...

A bi-directional Address-Event transceiver block for low-latency inter-chip communication in neuromorphic systems

Neuromorphic systems typically use the Address-Event Representation (AER...

Carrier Aggregation in Multi-Beam High Throughput Satellite Systems

Carrier Aggregation (CA) is an integral part of current terrestrial netw...

Mars: Near-Optimal Throughput with Shallow Buffers in Reconfigurable Datacenter Networks

The performance of large-scale computing systems often critically depend...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Starting with Bitcoin’s great success (Bitcoin), blockchain systems have become a critical use case for high-performance consensus protocols. A practical blockchain system can involve hundreds to thousands of geographically distributed nodes. In many blockchain systems, all to all is the dominating communication pattern. Each participating node has to receive all other nodes’ transactions to achieve complete decentralization in transaction processing. Also, the latency of committing a transaction is determined by when the sender receives acknowledgements from all other nodes or a majority quorum of nodes. To enhance the performance of blockchain systems, this paper considers the following general problem. Given a set of nodes connected by a WAN and each node having a stream of data to broadcast to all other nodes, how can we maximize the aggregated broadcast throughput while minimizing the latency for each node’s data to reach all other nodes.

Previous work (SteinerTreePacking)

has shown that optimizing multicast throughput alone is already an NP-Hard problem with the constraints of network topology and link capacities. Efficient solutions usually rely on heuristics or advanced technique such as network coding

(NetworkCoding). However, those solutions are not practical in a WAN environment due to the opaqueness of WAN’s exact physical topology. Instead of assuming a specific network topology for WAN, we model the WAN as a set of sites connected by a core network with unlimited bandwidth. Every site can send and receive data from each other, bottle-necked only by the edge link bandwidth between each site and the core network. This model is not only simple but also valid as shown by recent measurements (InternetCongestion).

Moreover, overlay multicast has been proven to be a useful technique to provide high throughput data dissemination in a network without detailed topological information. To our surprise, when modelling the network as a congestion-free core and leveraging overlay multicast, it is possible for all-to-all data transmission to achieve both optimal aggregated throughput and low latency for data to reach all other nodes. The general idea is similar to (SplitStream; Bullet), which split the source stream at each node into multiple partitions and broadcast each partition with different overlay trees. We build on this prior work by proving that the optimal aggregated throughput is always achievable by using broadcast trees with heights up to two.

2. Preliminaries

Network Model. The network topology is a complete graph with vertices. There are two functions and , which respectively define the uplink and downlink bandwidth of the edge link from a site to the core network. The data transmission between two sites consumes the sender’s bandwidth and the receiver’s bandwidth. The uplink and the downlink bandwidth of a site are shared by all unicast data transmissions associated with the site. For instance, a sender multicasting to receivers at the rate of will consume of the sender’s uplink bandwidth and of each receiver’s downlink bandwidth.

Rate of client data streams. A client data stream is an infinite sequence of data batches from clients to be broadcast to all other sites in the network. The rate of a client data stream represents the incoming rate of client data at site . For instance, letting be the number of incoming client requests per second at site and letting be the size of the data payload of each request, then the client data rate at site is . We assume that a site’s client data stream does not consume its downlink bandwidth.

Sustainable Rates. For an all-to-all data transmission in with sites, each site is associated with a client data stream that must be received by all other sites. Client data streams with rates are said to be sustainable if the following three conditions are all met:
(1) For each , ;  (2) For each , ;  (3) .

Intuitively, being sustainable is the minimum requirement for a set of client data streams to be broadcast at their incoming rates. Condition (1) ensures that each site has enough uplink bandwidth to send out its data at least once to other nodes. As each site has to receive from all other peers, condition (2) ensures that the aggregated rate of incoming streams does not exceed a site’s downlink bandwidth. Condition (3) derives from the fact that the client data at each site must be sent at least times. If any of the above conditions are violated, the aggregated throughput of all-to-all data transmission will be less than .

Partitioning Scheme. Assume the rate of a client data stream can be split at any granularity. A partitioning scheme of a client data stream with rate splits elements of into streams with rates such that . We refer to each split of the stream a sub-stream of .

3. Main Results

Theorem 3.1 ().

For client data streams with sustainable rates , there exists a partitioning scheme for each client data stream such that:

  1. Each sub-stream can be broadcast at its rate without violating downlink and uplink bandwidth constraints at any site.

  2. The height of each sub-stream’s broadcast tree is at most 2.

3.1. Proof Sketch

We prove theorem 3.1 by constructing a possible partitioning scheme for each client data stream and associating each sub-stream with a broadcast tree with a height up to two.

3.1.1. Constructing Sub-streams

Each client stream will be split into sub-streams with rates . The data of the special sub-stream is sent directly from to all the remaining sites. The data of sub-stream for is sent from to first, and then will broadcast the data to the rest of the sites. All the broadcast trees defined previously have a height bounded by two.

Input : 

Output : 
1 for  1 to  do
2       for  1 to  do
3             if  then
5            else
7             if  then
8                   break
Algorithm 1 Sub-stream rate assigning algorithm

3.1.2. Computing Sub-stream Rates

Algorithm 1 computes the rate of each sub-stream defined in the previous section. For each node , we divide its uplink bandwidth into two parts: and . represents the reserved uplink bandwidth for to send out all its data at least once; is the residual uplink bandwidth after excluding reserved uplink bandwidth. The algorithm will iterate over all site pairs in in lexicographical order to compute sub-stream rates. For the iteration when sub-stream rate is computed, the algorithm greedily allocates as much of as possible to until either is exhausted or the aggregated sub-stream rate reaches . According to the overlay trees defined in the previous section, sending consumes of and of . This rule also applies to the case , where sending consumes of . Note that, Algorithm 1 does not aim to compute the optimal partitioning scheme, which favours one level tree overlays.

4. Conclusion and Discussion

According to theorem 3.1, by leveraging overlay multicast, all-to-all data transmission can achieve the best possible throughput without paying an expensive price for latency. This result provides a theoretical foundation for ruling out deep overlay trees with heights greater than two when optimizing all-to-all data transmission for applications such as blockchains and consensus protocols. Although Theorem 3.1 relies on the rate of incoming data to be sustainable, we can resort to a two phase optimization when dealing with client data rates that are not sustainable. The first phase computes the optimal sustainable rates based on available network resources. The second phase computes the optimal combinations of overlay trees that yield the lowest latency given the sustainable rates obtained in the first phase.


Appendix A Pictures

Figure 1. Visualization of algorithm 1: This figure demonstrates an example with three sites with uplink bandwidth , , . Let client data stream rates be , , . After the first iteration, , and . After the second iteration, , and . After the final iteration, , and .
Figure 2. Network Model: An illustration of the network model used by this work
Figure 3. Overlay Tree Examples: This figure illustrates overlay trees constructed using the method described in section 3.1.1. In a network with four sites, the tree on the left is the overlay tree of while the tree on the right is the overlay tree of .

Appendix B Proofs

b.1. Correctness Criteria

The output of algorithm 1 is a set of sub-stream rates . A correct output satisfies the following three criteria for all :

  1. Valid Partition Constraint: The aggregated rate of all sub-streams of a client data stream equal to that client data stream’s rate, which is equivalent to for all such that .

  2. Uplink Capacity Constraint: The aggregated rate of all sub-streams sent by is less than or equal to .

  3. Downlink Capacity Constraint: The aggregated rate of all sub-streams received by is less than or equal to .

b.2. Correctness of Algorithm 1

Let represent the value of at the start of iteration of the outer loop

Proposition B.1 ().

, if , then at the end of iteration of outer loop.


This proposition can be proved by contradiction. Assume and at the end of iteration of the outer loop. If that is the case, line 1 is never executed, otherwise, becomes 0 at line 1 and the inner loop terminates with . At the start of the inner loop’s last iteration, . Since line 1 is executed at every iteration of the inner loop, we have and at the start of the inner loop’s last iteration. Because , implies , which contradicts with the assumption. ∎

Proposition B.2 ().


This proposition can be proved by induction on .
Base case : . According to condition (3) of sustainable rates, .
Induction step: For an arbitrary number such that , assume the proposition B.2 holds for . We have:


As a result of proposition B.1, at the end of the outer loop’s iteration . Since line 1 is executed at every iteration of the inner loop, we have for all i such that . By summing up all , and . Therefore, by subtracting from both sides of inequiality 1, we have . ∎

Proposition B.3 ().

The output of algorithm 1 satisfies Valid Partition Constraint


This proposition holds as the direct outcome of proposition B.1 and proposition B.2. ∎

Lemma B.4 ().

for all such that


We prove this lemma by induction on . Base case : By line 1, we have for all i such that , according to condition (1) of sustainable rates, holds for all i such that . Induction step: For an arbitrary number such that , assume . Line 1 and line 1 will guarantee and according to line 1. As a result, . ∎

Proposition B.5 ().

The output of algorithm 1 satisfies the Uplink Capacity Constraint


From lemma B.4, we have throughout the execution of algorithm 1 for all such that . According to the overlay defined in the previous section 3.1.1, sending consumes of and is consumed only by sending ’s sub-streams. By proposition B.3 , for all such that . Since equals to , sending all of ’s sub-streams will consume exactly the amount of its reserved uplink bandwidth. Because , no uplink bandwidth constraint is violated. ∎

Proposition B.6 ().

The output of algorithm 1 satisfies the Downlink Capacity Constraint


Since every sub-stream is broadcast by an overlay tree, each site receives all other site’s data exactly once. According to the condition (2) of sustainable rates, there is also no violation of downlink bandwidth constraint. ∎