# Latency, Capacity, and Distributed MST

Consider the problem of building a minimum-weight spanning tree for a given graph G. In this paper, we study the cost of distributed MST construction where each edge has a latency and a capacity, along with the weight. Edge latencies capture the delay on the links of the communication network, while capacity captures their throughput (in this case the rate at which messages can be sent). Depending on how the edge latencies relate to the edge weights, we provide several tight bounds on the time required to construct an MST. When there is no correlation between the latencies and the weights, we show that (unlike the sub-linear time algorithms in the standard CONGEST model, on small diameter graphs), the best time complexity that can be achieved is Θ̃(D+n/c), where edges have capacity c and D refers to the latency diameter of the graph. However, if we restrict all edges to have equal latency ℓ and capacity c, we give an algorithm that constructs an MST in Õ(D + √(nℓ/c)) time. Next, we consider the case where latencies are exactly equal to the weights. Here we show that, perhaps surprisingly, the bottleneck parameter in determining the running time of an algorithm is the total weight W of the constructed MST by showing a tight bound of Θ̃(D + √(W/c)). In each case, we provide matching lower bounds.

## Authors

• 8 publications
• 13 publications
• 29 publications
• 30 publications
• 4 publications
• ### A Linear Time Algorithm for Finding Minimum Spanning Tree Replacement Edges

Given an undirected, weighted graph, the minimum spanning tree (MST) is ...
08/09/2019 ∙ by David A. Bader, et al. ∙ 0

• ### Distributed Approximation of Minimum k-edge-connected Spanning Subgraphs

In the minimum k-edge-connected spanning subgraph (k-ECSS) problem the g...
05/20/2018 ∙ by Michal Dory, et al. ∙ 0

• ### Distributed Exact Weighted All-Pairs Shortest Paths in Near-Linear Time

In the distributed all-pairs shortest paths problem (APSP), every node ...
11/08/2018 ∙ by Aaron Bernstein, et al. ∙ 0

• ### Faster asynchronous MST and low diameter tree construction with sublinear communication

Building a spanning tree, minimum spanning tree (MST), and BFS tree in a...
07/28/2019 ∙ by Ali Mashreghi, et al. ∙ 0

• ### On the computational tractability of a geographic clustering problem arising in redistricting

Redistricting is the problem of dividing a state into a number k of regi...

• ### On the Complexity of Weight-Dynamic Network Algorithms

While operating communication networks adaptively may improve utilizatio...
05/27/2021 ∙ by Monika Henzinger, et al. ∙ 0

• ### Distributed Computation and Reconfiguration in Actively Dynamic Networks

In this paper, we study systems of distributed entities that can activel...
03/06/2020 ∙ by Othon Michail, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Construction of a minimum-weight spanning tree (MST) is one of the most fundamental problems in distributed computing, and has been extensively studied (see [7, 2, 1, 25, 8, 15, 23, 5, 4, 12, 11, 19, 6, 16] and references therein).

Much of this existing literature deals with the standard CONGEST model of communication [22] where all edges are identical and in every round nodes can communicate with all their neighbors via sized messages. In contrast, most real-world networks connections are not identical, with different links having different latencies (which may depend on distance, congestion, router speed, etc.). Lower latency means faster packet delivery.

Latency is not the only parameter that matters; if you talk to any networking expert, they will also ask about the throughput of a communication link: how fast can you push data? Here, we describe that as the capacity of a link, which we indicate as a fraction in . If a link has capacity 1, then you can send a new packet in every time step (even if the earlier ones have not arrived yet). If a link has capacity 1/10, then you can only send a new packet every 10 rounds.

Notice that determining the best way for two nodes to communicate can be tricky. If they only have one packet to exchange, you want to find the path with minimum latency (though such a path may have more hops). If they have a stream of packets to deliver, then you may want to find a path with high capacity. Again, if you want to minimize the message complexity, then you might want a path that minimizes the number of hops. Trying to simultaneously optimize these different parameters is a challenge!

In this paper, we study the problem of constructing an MST on graphs having edge latencies and capacity, giving algorithms and lower bounds for a variety of different cases.

Weighted CONGEST Model.  In more detail, the network is modeled as a connected, undirected graph with nodes and edges. Each edge represents a bi-directional synchronous communication channel that has three symmetric attributes associated with it: latency, capacity, and weight. The weight provides the parameter over which we build an MST.

If an edge has capacity , it implies that can send a new message to in every rounds, i.e. if had sent a message to in round , then it can send the next message to only in round . For simplicity, we assume that the rate at which data can be sent remains constant throughout the network, i.e. all edges have the same capacity .

If an edge has latency , it implies that it requires rounds for a message to be sent from to (or vice versa). We assume that each edge’s latency and weight are integers. (If not, they can be scaled and rounded to the nearest integer.) Let be the minimum latency of any edge of the given graph . We assume that . This ensures that, if there are no messages in transit over an edge, a node should be allowed to send a message over that edge.222It does not make sense for an edge to be blocked longer than its latency, so an edge of latency should have capacity at least .

Nodes know the value of and have unique ids. Nodes can send -bit messages to all their neighbors in a particular round. Nodes also know the latency, capacity, and weight of their adjacent edges; however, they are not aware of the ids of their neighbors (as in the KT0 model of computation). The latency diameter of the graph refers to the graph diameter with latencies. Any reference to the diameter means the latency diameter (unless otherwise mentioned as the hop diameter of the shortest path tree ).

Distributed Minimum Spanning Tree (MST) Construction.  Given a connected, edge-weighted undirected graph with latencies and capacity , the goal is to determine a set of edge that form a spanning tree of minimum weight. At the end of the distributed MST construction protocol, each node knows its own part of the output, e.g., which of its adjacent edges are in the computed MST.

Results.  In this paper, we introduce the weighted CONGEST model with edge latencies and capacities that closely mimic real-world communication. We study the effects of latency and capacity in determining the time required for constructing an MST. Depending on how the edge latencies relate to the weights, we provide several tight bounds on the time required to construct an MST.

We start by considering the case where there is no correlation between latencies and weights. In the standard CONGEST model, an MST can be constructed in time, where refers to the graph diameter without latencies. However, in a network with latencies, we show that sub-linear time MST construction is impossible. Specifically, we give a lower bound of rounds for constructing an MST, where refers to the latency diameter. Correspondingly, we also give an algorithm that constructs an MST in rounds and with messages.

A natural special case is where all edges to have equal latency . We give a simultaneous time and message optimal algorithm (derived from [6]) that constructs an MST in time and with messages. This is faster than the expected slowdown (achieved by scaling up the edge latencies from to in the standard CONGEST model), and this speed-up is achieved by exploiting the edge capacity through pipelining of messages.

Next, we consider the case where edge latencies are exactly equal to edge weights. Surprisingly, here the key parameter determining the delay due to congestion is the total weight of the constructed MST (rather than the total number of nodes in the graph). We show a lower bound of , and correspondingly, we also give an algorithm that constructs an MST in rounds. The algorithm has message complexity , where is the hop diameter of the shortest path tree. Additionally, as part of the lower bound proof, we provide a simulation that relates the running time of an algorithm in this model with that in the standard CONGEST model. (c.f. Lemma 5).

Challenges.  There are two basic challenges that arise in designing MST algorithms for networks with latencies and capacities. First, there may be many edges that are just too expensive to use, and a node will never even know the identity or status of its neighbors on the other side of these edges. Moreover, it may not be clear in advance which edges are too expensive to use, as that depends on various parameters, e.g., or . For example, when a node is trying to find a minimum weight outgoing edge of a component, it may never be able to find out whether a neighbor is in the same connected component. Or as another example, our MST algorithms rely on collecting information on BFS/shortest path trees; yet in constructing the BFS tree, there are some edges that cannot be used. How does a node know when the construction is complete? Throughout our protocols, we must carefully coordinate the exploration of edges to avoid using expensive edges and to compensate for unknown information.

Second, a key insight in existing distributed MST algorithms is balancing the cost between local aggregation (e.g., within connected components or fragments) and global aggregation (e.g., using a BFS or shortest path tree). This balance is no longer as simple to determine, as it depends on various unknown parameters of the graph, e.g., and . Our algorithms have to on-the-fly determine the best point to switch between different aggregation methods. Moreover, it is no longer the case that the same tree is good for both minimizing latency and message complexity. This makes the balancing problem even more difficult, if we want to maintain reasonable message complexity. A related problem shows up in the initial construction of the BFS/shortest path tree. In a model with unit-cost edges, there are a variety of strategies for electing a leader and using it to initiate a shortest-path tree (even with good message complexity [14]). However, when links have latencies, this becomes non-trivial, and we rely on a simple randomized strategy.

Summary of our Contributions.  Given an -node graph with latency diameter and capacity , and assuming there is no correlation between the edge latencies and the edge weights, we show that there exists a distributed algorithm that constructs an MST w.h.p and takes either:

1. rounds, if all edges have a uniform latency of , or

2. rounds, if edge weights and edge latencies can vary arbitrarily.

For the case where edge latencies correspond exactly to the edge weights, we show that there exists a distributed algorithm that creates an MST w.h.p. in:

1. rounds, where is the total weight of the MST.

We complement our results by showing lower bounds in terms of the edge capacity and the latency diameter. To show this we provide a simulation that relates the running time of an algorithm in the weighted CONGEST model with that in the standard CONGEST model. We prove that if there is no correlation between the edge latencies and the edge weights, there exist:

1. graphs that require rounds to construct an MST.

If edge latencies exactly correspond to the edge weights, we show that, there exists graphs with:

1. diameter that require rounds to construct an MST.

Prior Work.  The problem of distributed computation of MST was first proposed in the seminal paper of Gallager, Humblet, and Spira [7], which presented a distributed algorithm for MST construction in rounds and with messages. The time complexity was further improved in [2] and subsequently to existentially optimal by Awerbuch [1]. The existential optimality implies the existence of graphs for which is the best possible time complexity achievable. These (and many subsequent results, including this paper) are based on a non-distributed variant of the algorithm of Borůvka [17].

In a pioneering work [8] Garay, Kutten, and Peleg showed that the parameter that best describes the cost of constructing an MST is the graph diameter , rather than the total number of nodes . Here refers to the diameter of a graph (without latencies). For graphs with sub-linear diameter, they gave the first sub-linear distributed MST construction algorithm requiring rounds and messages. This was further improved to rounds and messages by Kutten and Peleg [15]. Shortly thereafter, Peleg and Rubinovich [23] showed that time is required by any distributed MST construction algorithm, even on networks of small diameter (), establishing the asymptotic near-tight optimality of the algorithm of [15]. Consequently, the same lower bound of was shown for randomized (Monte Carlo) and approximation algorithms as well [24, 5].

The message complexity lower bound of was first established by Awerbuch [1] for deterministic and comparison based randomized algorithms. In [14], Kutten et al. show that the lower bound holds for any algorithm, in the KT0 model of communication, where at the beginning, a node does not know the ids of its neighbors. However, for general randomized algorithms, if the nodes are aware of the ids of their neighbors at the beginning (KT1 model) the message complexity lower bound does not hold. In fact in [12], King, Kutten, and Thorup give an MST construction algorithm with a message complexity of only , however this came at the expense of having time complexity of . For asynchronous networks, Mashreghi and King [16] give an algorithm that computes an MST using only messages.

More recently, for the KT0 model, Pandurangan et al. [19] provide a randomized MST construction algorithm with time complexity and message complexity , which is simultaneously time an message optimal. Elkin [6], Haeupler et al. [10], and Ghaffari and Kuhn [9] have since provided improved deterministic algorithms that achieve time and message complexity (with improvements in the logarithmic factors).

There has been some recent work by Sourav, Robinson, and Gilbert [26] on the impact of latencies in distributed algorithms. They looked at the problem of gossip, and developed a notion of weighted conductance that captured the connectivity of a graph with latencies. They used this to analyze the cost of information dissemination in such graphs.

## 2 Uncorrelated Weights and Latencies

In this section, we consider the case when there is no relationship between the edge weights and the latencies, and either can take arbitrary values. We show that unlike the rounds tight bounds for MST construction on graphs without latencies or capacity (where refers to the diameter without latencies), the best that can be achieved in this case is .

### 2.1 Lower Bound

We now present a construction (cf. Figure 1) that shows a lower bound on the time complexity for computing MST when edge weights are independent of latencies. The construction comprises parallel gadgets, each a series of four vertices with three edges that we refer to as the left, middle and right edges, resp. Each of the left and right edges has weight either 1 or 2 chosen UIR, while all the middle edges have weight 1. All the middle edges have very high latencies of the form , whereas all other edges have a latency 1. All edges uniformly have a capacity . It is clear that for each gadget, we must include either the left edge or the right edge in the MST based on whether the left or the right edge has the higher weight. This is bits of entropy, which means that bits must cross over from right to left or left to right. This will take if we use even one of the middle edges, so a better approach would be to pipeline the information through the edge at the bottom, but since the capacity is , it will require at least time for the information to pass through the edge in either direction, even though has latency .

Moreover, it is a simple exercise to show that time will be required even when the weight of each edge is its latency. Consider the network that is a ring with edges of latency and the remaining two edges and are positioned diametrically opposed to each other and are assigned latencies and resp., where and are random integers from, say, . Recall that the weight of each edge equals its latency. Clearly, any MST algorithm must take time to determine whether or must be in the MST. Moreover, the bound holds for all values of as long as and can be suitably adjusted to be . This implies the following result:

###### Theorem 1.

Any algorithm (deterministic or randomized) for computing the MST of a network in which the edge weights and the latencies are independent of each other requires time.

### 2.2 Upper Bound

In this section, we provide an time algorithm for constructing an MST when there is no correlation between an edge’s latency and its weight. The algorithm is based on the pipeline algorithm [21, 20] for the standard CONGEST model, where nodes upcast adjacent edge information the root of a BFS tree. The root after receiving sufficient information then computes the MST and broadcasts the result to all. However, if this aggregation required information regarding all the edges to reach the root, the running time could be due to congestion. The key idea here (and in the pipeline algorithm) is to upcast while filtering non-required edges such that the amount of information sent is reduced to , thereby, reducing the running time.

Notice that, with arbitrary edge latencies, a hop optimal solution no longer implies a cost optimal solution. For example, the diameter of a BFS tree might be greater than the diameter of the graph, making a BFS tree unsuitable for algorithms requiring optimal time complexity. Therefore, here we use a shortest path tree rather than a BFS tree. However, constructing a deterministic shortest path tree with arbitrary latencies is also non-trivial, especially if we want an algorithm with low message complexity. Additionally, due to the lack of synchrony, another challenge here while upcasting is to ensure that each node has all the required information to determine the correct edge to upcast.

MST Algorithm for Arbitrary Weights and Latencies.

The basic outline of the algorithm is as follows. First, a particular node elects itself as the leader. Next, a shortest path tree w.r.t. latency is created with the leader as the root node. Nodes upcast information, starting from the leaf nodes while filtering non-essential information. The root computes the MST and broadcasts it to all the nodes.

Shortest Path Tree Construction and Leader Election.  To determine a shortest path tree rooted at some node, we use a simple randomized flooding mechanism: Initially, each node becomes active

with probability

and if it is active, it forms the root of a shortest path tree by entering the exploration phase. Then, each active node broadcasts a join message carrying its ID to its neighbors who in turn propagate this message to their neighbors and so on. The tree construction cannot wait to terminate until every edge is explored; instead a counting mechanism is used to determine when the tree is spanning. Therefore each root node sends out a count message (carrying its ID) in round , for each until exits the exploration phase, where is an integer such that . The count messages propagates through ’s (current) tree until they reach the leaf nodes, who initiate a convergecast back to the root with a count of . When a node receives the convergecast from its children it forwards the accumulated count to its parent in the shortest path tree.

Since multiple nodes are likely to become active and start this process, eventually a node will receive join messages originating from distinct root nodes. In that case, joins the shortest path tree rooted at the node with the maximal ID. If it has already joined some other shortest path tree previously that is rooted at a node with a smaller ID , it simply stops participating in that tree and responds to messages from that tree by sending a disband reply carrying ’s ID. A disband message propagates all the way to the root who in turn becomes inactive and exits the exploration phase.

If an active node is still in the exploration phase when it receives a count message carrying a count of , it stops exploring and broadcasts a done message through its tree.

###### Lemma 2.

In the weighted congest model, when edges have arbitrary latencies, there exists an algorithm to elect a leader and construct a shortest path tree rooted at in time using messages with high probability.

###### Proof.

The time complexity depends on the time until every node has exited the exploration phase. Observe that the active node with the maximum ID will have integrated all nodes into its shortest path tree in time and, by the description of an algorithm, once a node joins ’s tree, it does not leave it. Moreover, becomes aware that its tree has included all nodes within additional rounds due to the count messages.

To see that the message complexity bound holds, observe that, with high probability, there are active nodes and each active node may initiate the construction of a shortest path tree. ∎

To determine the MST, the leaf nodes start by upcasting its adjacent edges in non-decreasing order of weight. Intermediate nodes begin only after having received at least one message from each of its children. From the set of all the edges received until the current round (along with its own adjacent edges) intermediate nodes filter and upcast only the lightest edges that do not create a cycle. Notice that for any intermediate node, after it receives the first message from all of its children, all subsequent messages (at most from each child) arrive in a pipelined manner with an interval of (as edges have capacity ). Moreover, as the intermediate nodes start upcasting immediately after receiving the first message from each of its children, they also send at most messages up in a pipelined fashion, while filtering the heavier cycle edges. Waiting for at least one message from each child in the shortest path tree and the fact that messages are always upcast in a non-decreasing order ensures that in every round (after receiving at least one message from each child), nodes have sufficient data to upcast the lightest edge.

To identify edges that form a cycle, all nodes excepting the root, maintain two edge lists, and . Initially, for a vertex , contains all the edges adjacent to , and is empty. At the time of upcast, determines the minimum-weight edge set in that does not create any cycle with the edges in and upcasts it to its parent while moving this edge from to . Every parent node adds all the messages received from its children to its list. Finally, sends a terminate message to its parent when is empty. This filtering guarantees that each node upcasts at most edges to its parent. As edges have capacity , this requires at most rounds. Considering any path of the shortest path tree from a leaf node to the root, the maximum number of messages that are sent in parallel at any point of time on this path is at most . Since messages are always upcast in a pipelined fashion the time complexity is rounds. As each node sends at most messages, the message complexity is . Thus, we have shown the following result:

###### Theorem 3.

In the weighted congest model, there exists an algorithm that computes the MST in rounds and with messages w.h.p.

###### Proof.

The algorithm’s correctness follows directly from the cycle property [27, 13]. The filtering rule of the algorithm ensures that any edge sent upward by a node does not close a cycle with the already sent edges (edges in list ). Since the edges are upcast in a non-decreasing order of weight, and intermediate nodes begin only after receiving at least one message from each of its neighbors, in ensures that each intermediate node has enough information to send the correct lightest edge of say weight . This implies that, in no later round does that intermediate node receive a message of weight less than . As such, the only edges filtered are the heaviest cycle edges, which implies that none of the MST edges are ever filtered. The root receives all the MST edges (and possibly additional edges) required to compute the MST correctly. Termination is guaranteed as each node sends at most edges upwards and then a termination message. For a more detailed correctness proof, refer to the proofs in [21] and [18].

The time complexity is determined by the cost of creating the shortest path tree and the cost of doing a pipelined convergecast on this tree. The creation of the shortest path tree requires time and messages (c.f. Lemma 2). The pipelined convergecast is started by the leaf nodes by sending their lightest adjacent edges up. Thereafter each intermediate node upcasts only after receiving at least one message from each of its children, and this upcast of lightest edges that does not create a cycle happens in a pipelined fashion. The maximum delay at any intermediate node would be due to waiting for the messages from its furtherest sub-tree node. From the definition the graph diameter, this delay is bounded by . This implies that in the absence of congestion, the root node would receive all the required information in time. Secondly, from the filtering (and cycle property), it is guaranteed that the congestion at any point is not more than . As all edges in a path have capacity , the delay due to congestion is at most . (Similar for root node broadcasting the MST over the shortest path tree). Therefore, the total time complexity of weighted pipeline algorithm is (combining the cost of creating a shortest path tree with the cost of congestion). Furthermore, as each node can send at most edges, the message complexity is bounded by . ∎

## 3 Correlated Weights and Latencies

In this section, we consider the weights of the edges to be exactly equivalent to the edge latency (i.e. there is a direct correlation between the weights and the latencies). Unlike the case without latencies, where the running time of an algorithm depends on the total number of nodes (along with the diameter ), here we see that by equating weights with latencies the running time of any MST construction algorithm becomes dependent on the total weight of the MST (along with the diameter and the edge capacities). We show a tight bound of on the time required to construct an MST.

### 3.1 Lower Bound

###### Theorem 4.

Any algorithm to compute the MST of a network graph in which the weights correspond to latencies must, in the worst case, take time, where is the total weight of the MST.

###### Proof.

The lower bound has been addressed in Section 2 even for the current correlated setting. So we focus our efforts on showing a lower bound of . We do this in two steps. We first relate algorithms in our model to algorithms in the classical CONGEST model (cf. Lemma 5). Then, we complete the lower bounding argument by applying the lower bound from Das Sarma et al. [24].

###### Lemma 5.

Assume that we are given an algorithm in the weighted CONGEST model and that runs in rounds on a given graph with minimum edge latency , maximum edge latency and capacity . Then, can be run in rounds in the standard CONGEST model with messages of size bits.

###### Proof.

We first convert algorithm into version that is slowed down by an integer factor . In the converted algorithm, messages are only sent at times that are integer multiples of . If a message is supposed to be sent at time in algorithm , we send the message at time in . Note that if the running time of is , the running time of is .

We first show that by doing this scaling of the algorithm, each node already knows what messages it sends at time in algorithm quite a bit before time . Consider a node and some message that is sent by node at time in algorithm and thus at time in algorithm . In , knows the messages it sends at time after receiving all the messages that are received by by time . Consider a message that is received by from some neighbor at time in . Let be the latency of the edge . In , sends the message at a time . In the slowed down algorithm , therefore sends the message at the latest at time . Because the latency of the edge is , the message is thus received by at the latest at time . Node therefore knows all the information required for the messages it sends at time in already at least time units prior to sending the message. Because , in , all nodes thus know which messages to send at a given time at least rounds prior to time .

As a next step, we convert to an algorithm where nodes can send larger messages, but where all messages are only sent at times that are integer multiples of . Because messages in are known at least time units prior to be sent, a message that is sent by at time can be sent by at the latest time such that is an integer multiple of . Note that because all messages in are sent at the latest at the time when they are sent by , all messages are available in when they need to be sent and the time complexity of is at most the time complexity of . The number of messages of that have to be combined into a single message of is at most the number of messages that are sent over an edge in an interval of length by and thus in an interval of length by the original algorithm . Because the capacity of each edge is most , the number of messages that have to be combined into a single message of is thus at most .

Because only sends messages at times that are integer multiples of , the algorithm also works if we increase the latency of each edge to be . This might increase the total time complexity by one additive because at the end of the algorithm, the nodes might have to receive the last message before computing their outputs. If is the time complexity of , the time complexity of this modified is therefore at most . If all the edge latencies are integer multiples of , the model is exactly equivalent to the original CONGEST model with messages of size bits, where time is scaled by a factor and the claim of the lemma thus follows. ∎

From Das Sarma et al. [24], we know that computing the MST in the CONGEST model with bandwidth bits requires rounds; here is the bandwidth term referring to the number of bits that can be sent over an edge per round. Note that their construction uses edge weights 0 and 1, but since the MST will be the same when edge weights are offset by 1, their lower bound holds for the case where their edge weights 0 and 1 are changed to 1 and 2, respectively. This, in turn, translates to a lower bound of . ∎

### 3.2 Upper Bound

In this section, we provide an time algorithm for constructing an MST when the edge latency of each edge matches with its edge weight.

Preliminaries.  We first introduce some notation. Given a graph , let be the (unique) MST of . A fragment of is defined as a connected subgraph of , that is, is a rooted subtree of . The root of the fragment is called the fragment leader. Each fragment is identified by the id of the fragment leader, and each node knows its fragment’s id. An edge is called an outgoing edge of a fragment if one of its endpoints lies in the fragment and the other does not. The minimum-weight outgoing edge (MOE) of a fragment is the edge with minimum weight among all outgoing edges of .

MST Algorithm for Matching Weights and Latencies.  To obtain not only the optimal time complexity but also a reasonable message complexity, we base our algorithm idea on the Elkin’s algorithm [6] for graphs without latencies. The algorithm here constructs the MST in a bottom up fashion by building an initial set of fragments called base fragments that satisfy a certain condition; and then use a shortest path tree (to account for arbitrary edge latencies) instead of a BFS tree in [6] to do subsequent component mergings. Here, the algorithm here does not distinguish between different cases based on the graph diameter and does not guarantee optimal message complexity. To obtain a message optimal algorithm, a BFS tree could be used, but then the algorithm does not remain time optimal. (see Section 5 for a more involved discussion).

With arbitrary latencies, if we use the previous approach (of [6]) and focus on building base fragments up to a certain diameter, we can no longer say anything useful regarding the fragment size333This is because of the fact that with arbitrary latencies, a fragment with fragment diameter could have only nodes if it only contains an edge with latency . Alternatively, it could contain up to nodes. and hence the number of base fragments created. In the worst case, there can be up to base fragments. Another challenge with arbitrary latencies is the choice of the MOE edge on which mergings are allowed. If we do not distinguish between MOE edges (of different latencies), the cost of communicating within a fragment may become too high. Additionally, if we set too strict criteria for merging, fragments may not merge regularly enough, requiring a larger number of phases. To achieve optimal time complexity and minimal message complexity, there has to be a balance between the cost of communicating on a fragment with the cost of possible congestion caused by the number of created base fragments. This balance, as can be guessed from the lower bound, depends on the total weight of the MST. Nodes, without knowing the value of would have to determine on the fly the exact balance so as to decide when to switch using the shortest path tree.

To get around these issues, firstly, instead of controlling the growth fragment diameter directly, we limit the total weight up to which fragments can grow in a particular phase . Additionally, here in a particular phase of the matching base build algorithm, we only allow edges of weight (latency) or less to be used for fragment mergings. As such, for simplicity one can view the matching base build algorithm in phase to be running on the sub-graph of the given graph , that only consists of the edges of latency . Finally, we use a guess and double technique to determine the balance between the number and the diameter of base fragments. Initially, we present the algorithm assuming that the nodes know the value of and later show that even if is not known, the MST can be computed through a guess and double strategy.

We call a fragment in phase as a blocked fragment if all its adjacent MOE edges (including its own) has latency (weight) greater than , and therefore it cannot merge with any other fragment in phase . All the other fragments, that can still merge are called as non-blocked fragments. As the growth of some fragments is now blocked, another challenge here is to regulate the number of base fragments. (More base fragments leads to higher time complexity while accounting for congestion in communicating via the shortest path tree).

Creating Base Fragments.  The matching base build algorithm runs in phases and begins with each node as a singleton fragment. Thereafter in every phase, each fragment finds its MOE and some fragments are merged along their MOE in a controlled and balanced fashion, until the base fragments of the required total weight are obtained. Mergings are done by determining the MOE for each node of a fragment and convergecasting only the lightest edge seen up to the fragment leader. The fragment leader decides the overall MOE for the fragment, and the merging (if occurs) occurs over this MOE. The guarantee here is twofold; first that the fragments merge sufficiently regularly (i.e the number of fragments reduces by at least half in each phase) and the total number of blocked fragments at the end of the matching base build algorithm is not too much.

Consider to be the set of fragments at the start of the phase. In the first phase, consists of singleton fragments. For the purpose of analysis, we define a fragment graph as follows. For a particular phase , its fragment graph consists of the vertices , where each is a fragment at the start of phase of the algorithm. The edge set of is obtained by contracting the vertices of each fragment to a single vertex in and removing all resulting self-loops of , leaving only the MOE edges in set . Also, let be the set of edges chosen by the algorithm over which fragment mergings happen in the phase . Notice that the fragment graph is in fact a rooted tree. Additionally, note that is not explicitly constructed in the algorithm, rather it is just a construct to explain it. The pseudocode of the matching base build algorithm is shown in Algorithm 1; it uses a similar techniques as the controlled-GHS algorithm in [20] (also see [8], [14], and MST forest construction in [6]).

###### Lemma 6.

At the start of phase , each fragment has a diameter of at most . Specifically, at the end of the matching base build algorithm each fragment has diameter at most .

###### Proof.

We show via an induction on the phase number , that, at the start of phase , the the diameter of each fragment is at most . The base case, i.e., at the start of phase , the statement is trivially true, since which is greater than , the total weight of a singleton fragment. For the induction hypothesis, assume that the diameter of each fragment at the start of phase is at most . We show that when the phase ends (i.e., at the start of phase ), the diameter of each fragment is at most .

Fragments grow by merging with other fragments over a matching MOE edge of the fragment graph. We know from the description of the algorithm (see Line of Algorithm 1), that at least one of the fragments taking part in the merging has a diameter of at most (since only fragments with weight at most find MOE edges), however that MOE edge might lead to a fragment with larger diameter (at most ).

Additionally, some other fragments with weight (also the diameter) at most can possibly join with either of these merging fragments of the matching edge, if they did not have any adjacent matching edge (see Line of Algorithm 1).

We see that the resulting diameter of the newly merged fragment at the end of phase is determined by at most fragments, out of which at most one has a diameter of and the other three have weight/diameter of at most and these are joined by MOE edges (of weight at most , and therefore can possibly contribute at most to the diameter of the merged fragment). Therefore, the diameter at the end of phase is at most , for , completing the proof by induction.

Since the matching base build algorithm runs for phases, the weight of each fragment at the end of the algorithm is at most . ∎

###### Lemma 7.

At the start of phase , each non-blocked fragment has a total weight of at least .

###### Proof.

We prove the above lemma via an induction on the phase number . For the base case, i.e. at the start of the phase , there exist only singleton fragments which are of weight at least . For the induction hypothesis we assume that the statement is true for phase , i.e. at the start of phase , the total weight of each fragment is at least and show that the statement also holds for phase , i.e. at the start of phase , the total weight of each fragment is at least . To show this, consider all the non-blocked fragments in phase , each fragment would either have weight or less than that. For fragments with weight , the lemma is vacuously true. For the second case, where the fragment weight is , we know from the algorithm (see line and of algo:CGHS2), that all such fragments merge with atleast one more fragment. This other fragment has a weight at least (from the induction hypothesis), and therefore the total weight of the resulting fragment at least doubles i.e., becomes at least . Thus, proving the lemma. ∎

###### Lemma 8.

The number of fragments remaining at the start of phase is at most . Specifically, at the end of the matching base build algorithm the number of fragments remaining is at most .

###### Proof.

Each remaining fragment at the beginning of phase is either a non-blocked fragment, or a blocked fragment. We know from Lemma 7, at the start of phase , each non-blocked fragment has a total weight at least . Since fragments are disjoint and the total weight of all the fragments is (weight of the MST), this implies that the number of non-blocked fragments at the start of phase , is at most . Additionally, each blocked fragment would have all its adjacent MOE edges of weight (otherwise, it would not be blocked). The maximum possible number of MOE edges of weight that can exist, is at most , since each of the MOE edges would be a part of the MST and the total weight of the MST is . This implies that the number of blocked fragments at the start of phase , is also at most . Therefore, the total number of fragments remaining at the start of phase , is the sum of the non-blocked and the blocked fragments, which is . Thus, after phases, the number of remaining fragments is at most . ∎

###### Lemma 9.

Matching Base Build algorithm outputs at most MST fragments each of diameter at most in rounds and requiring messages.

###### Proof.

Each phase of the matching base build algorithm performs three major functions, namely finding the MOE, convergecast within the fragment and merging with adjacent fragment over the matched MOE edge. For finding the MOE, in each phase, every node checks each of its neighbor (maximum in time) in non-decreasing order of weight of the connecting edge starting from the last checked edge (from previous phase). Thus, each node contacts each of its neighbors at most once, except for the last checked node (which takes one message per phase). Hence total message complexity (over phases) is

 ∑v∈V2d(v)+log√W/c∑i=1∑v∈V1=m+n(12log(W/c))=O(m+nlog(W/c)),

where refers to the degree of a node.

The fragment leader determines the MOE for a particular phase , by convergecasting over the fragment, which requires at most rounds since the diameter of any fragment is bounded by (by Lemma 6). The fragment graph, being a rooted tree, uses a round deterministic symmetry-breaking algorithm [3, 19] to obtain the required matching edges in the case without latencies. Taking into account the required scale-up in case of the presence of latencies, the symmetry breaking algorithm is simulated by the leaders of neighboring fragments by communicating with each other; since the diameter of each fragment is bounded by and the maximum weight of the MOE edges is also , the time needed to simulate one round of the symmetry breaking algorithm in phase is rounds. Also, as only the MST edges (MOE edges) are used in communication, the total number of messages needed is per round of simulation. Since there are iterations, the total time and message complexity for building the maximal matching is and respectively. Afterwards, adding selected edges into (Line of the matching base build algorithm) can be done with additional message complexity and time complexity in phase . Thus, the overall message complexity of the algorithm is and the overall time complexity is . ∎

Since there are at most base fragments remaining (from Lemma 9), at most MST edges need to be discovered. However, it is at this point the fashion in which the fragment mergings occur changes. A shortest path tree is created as shown in Section 2.2, and thereafter the base fragments are progressively merged (using the shortest path tree) in iterations.444If , it implies that only singleton fragments remain. The given algorithm boils down to the algorithm in Section 2.2. Note that, the shortest path tree needs to be built only once. For easier explanation, we call the fragments created by merging base fragments (as well as any fragment created by merging using the shortest path tree) as mst-components.

Merging Components using Shortest Path Tree.  Each node determines its MOE (w.r.t. its component), requiring time and messages.555Note that the value of is known through the shortest path tree construction and a node need not wait for more than time to determine its MOE as no edge with weight (latency) would be present in the MST. These MOE edges are upcasted to the base fragment leader while filtering only the lightest edge, requiring rounds (base fragment diameter) and messages (as only lightest MOE is upcasted). Each base fragment leader upcasts the lightest known outgoing edge of the component that it belongs to up the shortest path tree where intermediate nodes wait until they receive at least one message from each of its children and then upcast the lightest edge of each component that they have received or belong to (starting from the component with the lowest id) to its parent. After receiving the first message from each child node, subsequent messages arrive in a pipelined order with intervals of (as edges have capacity ). As a total of at most messages is upcasted on , the maximum possible time required is . Correspondingly, the number of messages required is , where is the hop diameter of the shortest path tree . The root of the shortest path tree locally computes the component mergings (by locally simulating the matching base build algorithm) and thereafter informs all the fragment leaders of their updated component ids, which further downcasts this to all the nodes completing an iteration. The guarantee, like earlier, is that the number of components halves in every iteration, requiring a total of at most iterations.

The overall time and message complexity is determined by the cost of matching base build algorithm along with the cost pipelining over the shortest path tree. The time complexity is and the message complexity is .

Guessing and Doubling.  In the absence of the knowledge of the total weight of the MST, we can still perform the above algorithm, by guessing a value for in each iteration. We begin with an initial guess of , if the algorithm is successful for the guessed value of , it terminates; alternatively it doubles the guessed value and continues.

First, a shortest path tree is built. Thereafter, matching base build algorithm is run with the guessed value of . To check for success, each base fragment leader sends a single bit to the root such that the root can determine the count of the total number of base fragments present, if the number of base fragments is (c.f. Lemma 9

), for the current estimate of

, this would imply that the algorithm was successful at guessing the value of . Once the root determines the appropriate value of , it intimates all the other nodes to run the actual algorithm. Note that, this does not increase the overall time and message complexity by more than a constant factor.

###### Theorem 10.

In the weighted CONGEST model, when edge latencies equal edge weights there exists an algorithm that computes the MST in rounds and using messages w.h.p.

###### Proof.

The correctness of the algorithm immediately follows from the fact that in each phase or iteration MST fragments merge with one another, and the total number of fragments reduce by at least half. This ensures after the said many iterations, there is only one fragment remaining, which is in fact the MST.

The overall time and message complexity is determined by the cost of matching base build algorithm along with the cost of merging the components over the shortest path tree.

The time complexity of matching base build algorithm is (by Lemma 9). Thereafter, we determine the time complexity of merging all the base fragments using the shortest path tree. Creating the shortest path tree takes time (c.f. Lemma 2). The mst-components are merged using the shortest path tree in iterations. In every iteration, each node first determines its MOE edge in time (as no edge with latency can be a part of the MST). These MOE edges are upcasted to the base fragment leader requiring rounds (base fragment diameter). Each base fragment leader upcasts an edge up to the root of the shortest path tree where intermediate nodes first wait until they have received at least one message from all its children and then upcast the lightest edge seen (starting from the component with the lowest id) to its parent. After receiving the first message from each child node, subsequent messages arrive in a pipelined order with intervals of (as edges have capacity ). As each intermediate node forwards at most messages, the maximum possible time required for upcasting to the root is .

The overall time complexity is . ( as the number of base fragments cannot be greater than .)

The message complexity of matching base build is (by Lemma 9). Thereafter, we determine the message complexity of merging all the base fragments using the shortest path tree. Creating the shortest path tree takes messages (c.f. Lemma 2). The mst-components are merged using the shortest path tree in iterations. In every iteration, each node first determines its MOE edge in messages. These MOE edges are upcasted to the base fragment leader requiring messages (as transmitted on an mst fragment). As a total of messages are upcasted along the shortest path tree, the message complexity is , where is the hop diameter of the shortest path tree. Since there are a total of iterations, messages are required.

The overall message complexity becomes . ∎

## 4 Special Case: Uniform Latencies, Different Weights

In this section, we consider all the edges to have the exact same latency, and each edge has a different weight (ties can be broken using the node ids). It is here, where we illustrate the role of edge capacities in getting a faster solution. Given that all edges have the same latency , one would expect an slowdown from the results for the standard non-latency model. This is, in fact true, if we consider the worst case capacity , where a message can be sent over an edge only after the previous message has been delivered. However, when , a new message can be sent over an edge in every round. Instead of a direct multiplicative factor slowdown to (where is the graph diameter without latencies), we obtain an upper bound of due to this pipelining of the messages, where the difference between the two bounds is that the former uses the hop diameter while the latter uses the latency diameter . The reason for this is that, when all edge latencies are the same, then we have . More generally, for any given latency , we give an algorithm that constructs an MST in time. This section best illustrates the power of having a larger edge capacity, which our algorithm leverages when pipelining messages over an edge.

MST Algorithm for Uniform Latencies.  As previously, to obtain an algorithm that is simultaneously both time and message optimal, we base our algorithm on Elkin’s MST algorithm [6]. The key idea is to build an initial set of fragments (called base fragments) up to some required diameter and later construct the MST by merging these base fragments using a BFS tree while also retaining the underlying layer of base fragments. However, the diameter up to which base fragments are built is dependent on the graph diameter . If is sufficiently small, say less than a particular parameter , base fragments of diameter are constructed and used to obtain the MST; otherwise if , we only construct base fragments of diameter .

In contrast to the case without latencies and capacities where , here we consider the graph diameter as sufficiently small when it has a diameter , and based on this divide our algorithm into two cases. We observe that, this careful determination of this parameter (involving latencies and capacity) is sufficient to account for the case where edges have uniform latency. The choice of is also responsible for the speed-up obtained from edge capacities. The intuition behind choosing this particular value of , is to balance the cost of creating the base fragments with the cost of mergings done using the BFS tree (determined by the number of base fragments).

The algorithm begins by creating a BFS tree. Since all edges have uniform latency, we can use the BFS tree construction algorithm for the standard CONGEST model to create a BFS tree in time and with messages [6]. This can be easily done by scaling one round of the standard CONGEST model to rounds here. The time taken will be given by which is equal to .

Creating Base Fragments.  Uniform Base Build algorithm begins with each node as a singleton fragment and thereafter in every phase, fragments merge in a controlled and balanced manner, while ensuring that the number of fragments at least halve in each phase. When , we start by building base fragments of diameter , whereas when the base fragments are built with fragment diameter of .

In fact, we show that the total number of fragments that remain after the matching base build part is , and the diameter of each fragment is at most .

Similar to the analysis of the algorithm in section 3.2, we define the set of fragments , the fragment graph , and the edge set . The pseudocode for building base fragments (when ) is shown in Algorithm 2 and uses similar techniques as in Section 3.2 and the controlled-GHS algorithm of [20].

###### Lemma 11.

At the start of phase , each fragment has diameter at most . Specifically, at the end of the uniform base build algorithm each fragment has diameter at most . (hop diameter )

###### Proof.

We show via an induction on the phase number , that, at the start of phase , the the diameter of each fragment is at most . The base case, i.e., at the beginning of phase , the statement is trivially true, since which is greater than , the diameter of a singleton fragment. For the induction hypothesis, assume that the diameter of each fragment at the start of phase is at most . We show that when the phase ends (i.e., at the start of phase ), the diameter of each fragment is at most . We see that a fragment grows via merging with other fragments over a matching edge of the fragment graph. Also, from the algorithm, it is to be noted that at least one of the fragments that is taking part in the merging has diameter at most since only fragments with diameter at most find MOE edges; the MOE edge may lead to a fragment with larger diameter, i.e., at most .

Additionally, some other fragments (with diameter at most ) can possibly join with either merging fragments of the matching edge, if they did not have any adjacent matching edge (see Line of Algorithm 2).

Therefore, the resulting diameter of the newly merged fragment at the end of phase is at most , since the diameter of the combined fragment is determined by at most fragments, out of which at most one has diameter and the other three have diameter at most and these are joined by MOE edges (which contributes to ). Thus, the diameter at the end of phase is at most , for .

Since the uniform base build algorithm runs for phases, the diameter of each fragment at the end of the algorithm is at most . ∎

###### Lemma 12.

At the start of phase , each fragment has size at least and the number of fragments remaining is at most . Specifically, at the end of the uniform base build algorithm the number of fragments remaining is at most .

###### Proof.

We prove the above lemma via an induction on the phase number . For the base case, i.e. at the start of the phase , there exist only singleton fragments which are of size at least . For the induction hypothesis we assume that the statement is true for phase , i.e. at the start of phase , the size of each fragment is at least and show that the statement also holds for phase , i.e. at the start of phase , the size of each fragment is at least . To show this, consider all the fragments in phase , each fragment would either have diameter or less than that. It is easy to see that if a fragment has diameter of , it would have greater than nodes, as latency of each edge is . For the second case, where the fragment diameter is , we know from the algorithm (see line and ), that all such fragments merge with atleast one more fragment. This other fragment is of size at least (from the induction hypothesis), and therefore the size of the resulting fragment at least doubles i.e., becomes at least .

Since fragments are disjoint, this implies that the number of fragments at the start of phase , is at most . Thus, after phases, the number of fragments is at most . ∎

###### Lemma 13.

Uniform Base Build algorithm outputs at most MST fragments each of diameter at most in rounds and requiring messages.

###### Proof.

Each phase of the uniform base build algorithm performs three major functions, namely finding the MOE, convergecast within the fragment and merging with adjacent fragment over the matched MOE edge. For finding the MOE, in each phase, every node checks each of its neighbor (in time) in non-decreasing order of weight of the connecting edge starting from the last checked edge (from previous phase). Thus, each node contacts each of its neighbors at most once, except for the last checked node (which takes one message per phase). Hence total message complexity (over phases) is

 ∑v∈V2d(v)+log√n/cℓ∑i=1∑v∈V1=m+n(12logncℓ)=O(m+nlogncℓ)

where refers to the degree of a node.

The fragment leader determines the MOE for a particular phase , by convergecasting over the fragment, which requires at most rounds since the diameter of any fragment is bounded by (by Lemma 11). The fragment graph, being a rooted tree, uses a round deterministic symmetry-breaking algorithm [3, 19] to obtain the required matching edges in the case without latencies. Taking into account the required scale-up in case of the presence of latencies (one round for the non-latency case would be simulated as round in this case), the symmetry breaking algorithm is simulated by the leaders of neighboring fragments by communicating with each other; since the diameter of each fragment is bounded by , the time needed to simulate one round of the symmetry breaking algorithm in phase is rounds. Also, as only the MST edges (MOE edges) are used in communication, the total number of messages needed is per round of simulation. Since there are iterations, the total time and message complexity for getting the maximal matching is and respectively. Afterwards, adding selected edges into (Line of the uniform base build algorithm) can be done with additional message complexity and time complexity in phase . The mergings are done over the matching edges and require one more round of convergecast to inform the nodes regarding the new fragment leader. This also takes time and messages. Since there are phases in the uniform base build algorithm, the overall message complexity of the algorithm is and the overall time complexity is . ∎

Mergings Components Using BFS Tree.  Here the mst-components are merged using a BFS tree in iterations. The merging here happens as shown in Section 3.2, however, now the upcasting can be done in a synchronous fashion as the latencies are uniform. Upcasting MOE edges require time and at most , where is the hop diameter of the BFS tree. However as is , it implies that , which further implies that . This trick of differentiating based on helps in limiting the messages to .

###### Lemma 14.

For the case when , merging components using the BFS tree requires rounds and messages.

###### Proof.

In the first step of the iteration, determining the MOE requires time and messages. Upcasting the MOE’s to the base fragment leader requires rounds (base fragment diameter) and messages (as only lightest MOE edge is upcasted along a fragment). Next the base fragment leaders upcast to the root of the BFS tree in synchronous phases. For each base fragment ( many) only one MOE edge is upcasted to the root of the shortest path tree. Therefore, the maximum possible time required for upcasting to the root is and the number of messages sent is at most , where is the hop diameter of the BFS tree. However as in this case is , it implies that , which further implies that . Once the root of the BFS tree obtains the MOE edges of all the components, it locally computes the component mergings (by locally simulating the uniform base build algorithm) and thereafter informs all the fragment leaders of their updated component ids requiring and messages. Finally, the base fragment leaders inform all the nodes of the base fragment which requires at most rounds and messages.

The cost of one iteration of merging using the BFS Tree is time and messages. Since there are iterations the time complexity becomes and the message complexity is . ∎

The overall time and message complexity is determined by the cost of creating the base fragments along with the cost of merging the mst-components using the BFS tree. Therefore, the running time for the case is calculated as and the message complexity is .

The case where , the algorithm runs in a similar fashion except here the base fragments grow until their diameter equals the graph diameter . In this case, the uniform base build algorithm runs for phases (as all edge latencies are , in worst case ) and outputs at most fragments, requiring a running time of time and messages. Thereafter, the mergings over the BFS tree require time (as ) and messages. We see that for this case as well, we obtain a time complexity of and a message complexity of . These results are proved using the following lemmas.

###### Lemma 15.

Uniform Base Build algorithm for the case where runs for phases and outputs at most MST fragments, each of diameter at most in rounds and requiring messages.

###### Proof.

From Lemma 11, we see that for the uniform base build algorithm at the start of phase , each fragment has diameter at most . Therefore after phases here, the fragment diameter would be at most

From Lemma 12, we see that for the uniform base build algorithm at the start of phase , each fragment has size at least and the number of fragments remaining is at most . Therefore after phases here, the fragment size is at least and the number of fragments remaining is at most . Since here , it implies that the number of fragments is .

Thereafter to determine the time and message complexity, we give a similar analysis as Lemma 13 Each phase of the uniform base build algorithm performs three major functions, namely finding the MOE, convergecast within the fragment and merging with adjacent fragment over the matched MOE edge. For finding the MOE, in each phase, every node checks each of its neighbor (in time) in non-decreasing order of weight of the connecting edge starting from the last checked edge (from previous phase). Thus, each node contacts each of its neighbors at most once, except for the last checked node (which takes one message per phase). Hence total message complexity (over phases) is

 ∑v∈V2d(v)+logD/ℓ∑i=1∑v∈V1=m+n(logDℓ)=O(m+nlogDℓ)

where refers to the degree of a node.

The fragment leader determines the MOE for a particular phase , by convergecasting over the fragment, which requires at most rounds since the diameter of any fragment is bounded by (by Lemma 11). The fragment graph, being a rooted tree, uses a round deterministic symmetry-breaking algorithm [3, 19] to obtain the required matching edges in the case without latencies. Taking into account the required scale-up in case of the presence of latencies (one round for the non-latency case would be simulated as round in this case), the symmetry breaking algorithm is simulated by the leaders of neighboring fragments by communicating with each other; since the diameter of each fragment is bounded by , the time needed to simulate one round of the symmetry breaking algorithm in phase is rounds. Also, as only the MST edges (MOE edges) are used in communication, the total number of messages needed is per round of simulation. Since there are iterations, the total time and message complexity for getting the maximal matching is and respectively. Afterwards, adding selected edges into (Line of the uniform base build algorithm) can be done with additional message complexity and time complexity in phase . The mergings are done over the matching edges and require one more round of convergecast to inform the nodes regarding the new fragment leader. This also takes time and messages. Since there are phases in the uniform base build algorithm when , the overall message complexity of the algorithm is