Distributed Computation in the Node-Congested Clique

05/18/2018 ∙ by John Augustine, et al. ∙ Indian Institute Of Technology, Madras University of Freiburg ETH Zurich Universität Paderborn University of Houston 0

The Congested Clique model of distributed computing, which was introduced by Lotker, Patt-Shamir, Pavlov, and Peleg [SPAA'03, SICOMP'05] and was motivated as "a simple model for overlay networks", has received extensive attention over the past few years. In this model, nodes of the system are connected as a clique and can communicate in synchronous rounds, where per round each node can send O( n) bits to each other node, all the same time. The fact that this model allows each node to send and receive a linear number of messages at the same time seems to limit the relevance of the model for overlay networks. Towards addressing this issue, in this paper, we introduce the Node-Congested Clique as a general communication network model. Similarly to the Congested Clique model, the nodes are connected as a clique and messages are sent in synchronous communication rounds. However, here, per round, every node can send and receive only O( n) many messages of size O( n). To initiate research on our network model, we present distributed algorithms for the Minimum Spanning Tree, BFS Tree, Maximal Independent Set, Maximal Matching, and Coloring problem for an input graph G=(V,E), where each clique node initially only knows a single node of G and its incident edges. For the Minimum Spanning Tree problem, our runtime is polylogarithmic. In all other cases the runtime of our algorithms mainly depends on the arboricity a of G, which is a constant for many important graph families such as planar graphs. At the core of these algorithms is a distributed algorithm that assigns directions to the edges of G so that at the end, every node is incident to at most O(a) outgoing edges.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Nowadays, most of the distributed systems and applications do not have a dedicated communication infrastructure, but instead share a common physical network with many others. The logical network formed on top of this infrastructure is called an overlay network. For these systems, the amount of information that a node can send out in a single round does not scale linearly with the number of its incident edges. Instead, it rather depends on the bandwidth of the connection of the node to the communication infrastructure as a whole. For these networks, it is therefore more reasonable to impose a bound on the amount of information that a node can send and receive in one round, rather than imposing a bound on the amount of information that can be sent along each of its incident edges. Also, the topology of the overlay network may change over time, and these changes are usually under the control of the distributed application. To capture these aspects, we propose to study the so-called Node-Capacitated Clique model. The model is inspired in part by the Congested Clique model introduced first by Lotker, Patt-Shamir, Pavlov, and Peleg (Lotker et al., 2005), which has received significant attention recently (Jurdziński and Nowicki, 2018a; Lenzen, 2013; Ghaffari and Parter, 2016; Hegeman et al., 2015; Korhonen, 2016; Lotker et al., 2005; Jurdziński and Nowicki, 2018b; Censor-Hillel et al., 2015; Dolev et al., 2012; Becker et al., 2017; Censor-Hillel et al., 2017; Hegeman and Pemmaraju, 2014; Hegeman et al., 2014; Gall, 2016; Konrad, 2018; Ghaffari et al., 2018; Ghaffari, 2017; Becker et al., 2018; Ghaffari and Nowicki, 2018).

Similarly to the Congested Clique model, the nodes of the Node-Capacitated Clique are interconnected by a complete graph. However, in the Node-Capacitated Clique every node can only send and receive at most messages consisting of bits in each round. This limitation is added precisely to address the issue explained above. It particularly rules out the possibility that the model allows one node to be in contact with up to other nodes at the same time; a property of the Congested Clique that seems to severely limit its practicability. We comment that the capacity bound of messages per node per round is a natural choice: it is small enough to ensure scalability and any smaller would require unnecessarily complicated techniques for the protocol to ensure nodes do not receive more messages than the capacity bound.

Compared to traditional overlay network research, the Node-Capacitated Clique model has the advantage that it abstracts away the issue of designing and maintaining a suitable overlay network, for which many solutions have already been found in recent years. Nevertheless, it is closely related to overlay networks: every overlay network algorithm (i.e., an algorithm in which overlay edges can be established by introducing nodes to each other, and which satisfies the capacity bound of messages) can be simulated in the Node-Capacitated Clique without any overhead. Furthermore, any algorithm for our model can be simulated with a multiplicative runtime overhead in the CRCW PRAM model (by assigning each processor memory cells, and letting nodes write into randomly chosen cells of other processors), which in turn can be simulated with only overhead by a network of constant degree (Ranade, 1991). The Congested Clique model and its broadcast variant, on the other hand, are far more powerful (and arguably beyond what is possible in overlay networks): Whereas in the Congested Clique a total of bits can be transmitted in each round, in the Node-Capacitated Clique only bits may be sent. For example, the gossip problem—i.e., delivering one message from each node to every other node—can be solved in a single round in the Congested Clique, whereas the problem requires at least rounds in the Node-Capacitated Clique model. Even the simple broadcast problem—i.e., delivering one message from one node to all nodes—already takes time in the Node-Capacitated Clique.

In this paper, we assume some edges of the network are marked as edges of an input graph , where each node knows which other nodes are its neighbors in , and aim to solve graph problems on using the power of the Node-Capacitated Clique. Such edges can, for instance, be seen as edges of an underlying physical network, or represent relations between nodes in social networks. Our results in that direction also turn out to be useful for some other theoretical models as well: they are relevant for hybrid networks (Gmyr et al., 2017) and also the -machine model for processing large scale graphs (Klauck et al., 2015).

The concept of hybrid networks has just recently been considered in theory (e.g., (Gmyr et al., 2017)). In a hybrid network, nodes have different communication modes: We are given a network of cheap links of arbitrary topology that is not under the control of the nodes and may potentially be changing over time. In addition to that, the nodes have the ability to build arbitrary overlay networks of costly links that are fully under the control of the nodes. Cell phones, for example, can communicate in an ad-hoc fashion via their WiFi interfaces, which is for free but only has a limited range, and whose connections may change as people move. Additionally, they may use their cellular infrastructure, which comes at a price, but remains fully under their control. Although in the idealized setting this overlay network may form a clique, to save costs, the nodes might want to exchange only a small amount of messages of small size in each communication round. This property is captured by the Node-Capacitated Clique. The network of cheap links, on the other hand, can be seen as an input graph in the Node-Capacitated Clique for which the nodes want to solve a graph problem of interest.

Another interesting application of the Node-Capacitated Clique is the recently introduced -machine model (Klauck et al., 2015), which was designed for the study of data center level distributed algorithms for large scale graph problems. Here, a data center with servers is modeled as machines that are fully interconnected and capable of executing synchronous message passing algorithms. A standard approach for the -machine model is to partition the input graph in a fair way so that each machine stores a set of nodes of the input graph with their incident edges. It is quite natural to simulate algorithms designed for the Node-Capacitated Clique model in the -machine model. Precisely, any algorithm that requires rounds in the Node-Capacitated Clique model can be simulated to take at most time . The details of this simulation can be found in Appendix A. To illustrate the usefulness of this simulation, we remark that the running time of the fast minimum spanning tree algorithm provided by Pandurangan et al. (Pandurangan et al., 2016) can be obtained simply by converting the algorithm we provide in this work to the -machine model.

As we demonstrate in this paper, many graph problems can be solved efficiently in the Node-Capacitated Clique, which shows that many interesting problems can be solved efficiently in distributed systems based on an overlay network over a shared infrastructure as well as hybrid networks and server systems.

1.1. Model and Problem Statement

In the Node-Capacitated Clique model we consider a set of computation entities that we model as nodes of a graph. Each node has a unique identifier consisting of bits and every node knows the identifiers of all nodes such that, on a logical level, they form a complete graph. Note that since every node knows the identifier of every other node, the nodes also know the total number of nodes . As node identifiers are common knowledge, without loss of generality we can assume that the identifiers are from the set .

The network operates in a synchronous manner with time measured in rounds. In every round, each node can perform an arbitrary amount of local computation and send distinct messages consisting of bits to up to other nodes. The messages are received at the beginning of the next round. A node can receive up to messages. If more messages are sent to a node, it receives an arbitrary subset of messages. Additional messages are simply dropped by the network.

Let be an undirected graph with an arbitrary edge set, but the same node set as the Node-Capacitated Clique. We aim to solve graph problems on in the Node-Capacitated Clique model. At the beginning, each node locally knows which identifiers correspond to its neighbors in , but has no further knowledge about the graph.

1.2. Related Work

The Congested Clique model has already been studied extensively in the past years. Problems studied in prior work include routing and sorting (Lenzen, 2013), minimum spanning trees (Ghaffari and Parter, 2016; Hegeman et al., 2015; Korhonen, 2016; Lotker et al., 2005; Jurdziński and Nowicki, 2018b), subgraph detection (Censor-Hillel et al., 2015; Dolev et al., 2012; Becker et al., 2018), shortest paths (Becker et al., 2017; Censor-Hillel et al., 2015), local problems (Censor-Hillel et al., 2017; Hegeman and Pemmaraju, 2014; Hegeman et al., 2014), minimum cuts (Jurdziński and Nowicki, 2018a; Ghaffari and Nowicki, 2018), and problems related to matrix multiplication (Censor-Hillel et al., 2015; Gall, 2016). Some of the upper bounds are astonishingly small, such as the constant-time upper bound for routing and sorting and for the computation of a minimum spanning tree, demonstrating the power of the Congested Clique model.

While almost no non-trivial lower bounds exist for the Congested Clique model (due to their connection to circuit complexity (Drucker et al., 2014)), various lower bounds have already been shown for the more general CONGEST model (Sarma et al., 2011; Frischknecht et al., 2012; Kutten and Peleg, 1998; Lenzen and Peleg, 2013; Nanongkai, 2014; Peleg and Rubinovich, 2000; Elkin, 2004). As pointed out in (Korhonen and Suomela, 2017), the reductions used in these lower bounds usually boil down to constructing graphs with bottlenecks, that is, graphs where large amounts of information have to be transmitted over a small cut. As this is not the case for the Node-Capacitated Clique, the lower bounds are of limited use here. Therefore, it remains interesting to determine upper and lower bounds for the Node-Capacitated Clique.

Hybrid networks have only recently been studied in theory. An example is the hybrid network model proposed in (Gmyr et al., 2017), which allows the design of much faster distributed algorithms for graph problems than with a classical communication network. Also, the problem of finding short routing paths with the help of a hybrid network approach has been considered (Jung et al., 2018). A priori, these papers do not assume that the nodes are completely interconnected, so extra measures have to be taken to build up appropriate overlays. Abstracting from that problem, the Node-Capacitated Clique allows one to focus on how to efficiently exchange information in order to solve the given problems.

The graph problems considered in this paper have already been extensively studied in many different models. In the CONGEST model, for example, a breadth-first search can trivially be performed in time . There exists an abundance of algorithms to solve the maximal independent set, the maximal matching, and the coloring problem in the CONGEST model (see, e.g., (Barenboim et al., 2016) for a comprehensive overview). Computing a minimum spanning tree has also been well studied in that model (see, e.g., (Elkin, 2004, 2006; Peleg and Rubinovich, 2000; Sarma et al., 2011)). Whereas the running times of the above-mentioned algorithms depend on and additional polylogarithmic factors, there have also been proposed algorithms to solve such problems more efficiently in graphs with small arboricity (Barenboim and Elkin, 2009, 2010, 2011; Barenboim et al., 2016; Kothapalli and Pemmaraju, 2011, 2012). Notably, Barenboim and Khazanov (Barenboim and Khazanov, 2018) show how to solve a variety of graph problems in the Congested Clique efficiently given such graphs, e.g., compute an -orientation in time , an MIS in time , and an -coloring in time , where is the arboricity of the given graph. The algorithms make use of the Nash-Williams forest-decomposition technique (Nash-Williams, 1964), which is one of the key techniques used in our work.

1.3. Our Contribution

Problem Runtime Section
Minimum Spanning Tree 3
BFS Tree 5.1
Maximal Independent Set 5.2
Maximal Matching 5.3
-Coloring 5.4
Table 1. An overview of our results. We use for arboricity and to denote the diameter of the given graph.

We present a set of basic communication primitives and then show how they can be applied to solve certain graph problems (see Table 1 for an overview). Note that for many important graph families such as planar graphs, our algorithms have polylogarithmic runtime (except when depending on the diameter ).

Although many of our algorithms rely on existing algorithms from literature, we point out that most of these algorithms cannot be executed in the Node-Capacitated Clique in a straight-forward fashion. The main reason for that is that high-degree nodes cannot efficiently communicate with all of their neighbors directly in our model, which imposes significant difficulties to the application of the algorithms. To overcome these difficulties, we present a set of basic tools that still allow for efficient communication, and combine it with variations of well-known algorithms and novel techniques. Notably, we present an algorithm to compute an orientation of the input graph with arboricity , in which each edge gets assigned a direction, ensuring that the outdegree of any node is at most . The algorithm is later used to efficiently construct multicast trees to be used for communication between nodes. Achieving this is a highly nontrivial task in our model and requires a combination of techniques, ranging from aggregation and multicasting to shared randomness and coding techniques. We believe that many of the presented ideas might also be helpful for other applications in the Node-Capacitated Clique.

Although proving lower bounds for the presented problems seems to be a highly nontrivial task, we believe that many problems require a running time linear in the arboricity. For the MIS problem, for example, it seems that we need to communicate at least bit of information about every edge (typically in order for a node of the edge to learn when the edge is removed from the graph because the other endpoint has joined the MIS). However, explicitly proving such a lower bound in this model seems to require more than our current techniques in proving multi-party communication complexity lower bounds.

2. Preliminaries

In this section, we first give some basic definitions and describe a set of communication primitives needed throughout the paper.

2.1. Basic Definitions and Notation

Let be an undirected graph. The neighborhood of a node is defined as }, and denotes its degree. With we denote the maximum degree of all nodes in , and is the average degree of all nodes. The diameter of is the maximum length of all shortest paths in .

The arboricity of is the minimum number of forests into which its edges can be partitioned. Since the edges of any graph with maximum degree can be greedily assigned to forests, . Furthermore, since the average degree of a forest is at most , and the edges of can be partitioned into forests, . Graphs of many important graph families have small arboricity although their maximum degree might be unbounded. For example, a tree obviously has arboricity . Nash-Williams (Nash-Williams, 1964) showed that the arboricity of a graph is given by , where is a subgraph of with at least two nodes and and denote the number of nodes and edges of , respectively. Therefore, any planar graph, which has at most edges, has arboricity at most . In fact, any graph with genus , which is the minimum number of handles that must be added to the plane to embed the graph without any crossings, has arboricity (Barenboim and Elkin, 2010). Furthermore, it is known that the family of graphs that exclude a fixed minor (Deo and Litow, 1998) and the family of graphs with bounded treewidth (Dujmovic and Wood, 2007) have bounded arboricity.

An orientation of is an assignment of directions to each edge, i.e., for every either ( is directed to ) or ( is directed to ). If , then is an in-neighbor of and is an out-neighbor of . For define and . The indegree of a node is defined as and its outdegree is . A -orientation is an orientation with maximum outdegree . For a graph with arboricity , there always exists an -orientation: we root each tree of every forest arbitrarily and direct every edge from child to parent node.

To allow each node to efficiently gather information sent to it by other nodes, our communication primitives make heavy use of aggregate functions. An aggregate function maps a multiset of input values to some value . For some functions it might be hard to compute in a distributed fashion, so we will focus on so-called distributive aggregate functions: An aggregate function is called distributive if there is an aggregate function such that for any multiset and any partition of , . Classical examples of distributive aggregate functions are MAX, MIN, and SUM.

Our algorithms make heavy use of randomized strategies. To show that the correctness and runtime of the algorithms hold with high probability (w.h.p.)

111We say an event holds with high probability, if it holds with probability at least for any fixed constant ., we use a generalization of the Chernoff bound in (Schmidt et al., 1995) (Theorem 2):

Lemma 2.1 ().

Let be

-wise independent random variables with

and let . Then it holds for all , , and

2.2. Communication Primitives

Our algorithms make heavy use of a set of communication primitives, which are presented in this section. Whereas the Aggregate-and-Broadcast algorithm will be used as a general tool for aggregation and synchronization purposes, the other primitives are used to allow nodes to send and receive messages to and from specific sets of nodes associated with them. Note that a node is not able to send or receive a large set of messages in few rounds; the center of a star, for example, would need linear time to deliver messages to all of its neighbors. If, however, the number of distinct messages a node has to send is small, or if messages destined at a node can be combined using an aggregate function, then messages can be efficiently delivered using a randomized routing strategy. Due to space limitations, we only present the high-level ideas of our algorithms and state their results. The full description and all proofs can be found in Appendix B.

Butterfly Simulation.

To distribute local communication load over all nodes of the network, our algorithms rely on an emulation of a butterfly network. Formally, for , the -dimensional butterfly is a graph with node set , where we denote , and an edge set with

The node set represents level of the butterfly, and node set represents column of the butterfly. In our algorithms, every node with identifier emulates the complete column of the -dimensional butterfly with . Since knows the identifiers of all other nodes, it knows exactly which nodes emulate its neighbors in the butterfly. As every node in the Node-Capacitated Clique can send and receive messages in each round, and the butterfly is of constant degree, a communication round in the butterfly can be simulated in a single round in our model.

Aggregate-and-Broadcast Problem.

We are given a distributive aggregate function and a set , where each member of stores exactly one input value. The goal is to let every node learn .

Theorem 2.2 ().

There is an Aggregate-and-Broadcast Algorithm that solves any Aggregation Problem in time .

In principle, the algorithm first aggregates all values from the topmost (i.e., level ) to the bottommost level (i.e., level ) of the butterfly, and then broadcasts the result upwards to all nodes in the butterfly.

Aggregation Problem.

We are given a distributive aggregate function and a set of aggregation groups , , with targets , where each node holds exactly one input value for each aggregation group of which it is a member, i.e., .222We only enumerate the aggregation groups from to simplify the presentation of the algorithm. Actually, we only require each aggregation group to be uniquely identified, which can easily be achieved for all algorithms in this paper. Note that a node may be member or target of multiple aggregation groups. The goal is to aggregate these input values so that eventually knows for all . We define to be the global load of the Aggregation Problem, and the local load , where and . Whereas the global load captures the total number of messages that need to be processed, and indicate the work required for inserting messages into the butterfly, or sending aggregates from butterfly nodes to their targets, respectively. We require that every node knows the identifier and target of all aggregation groups it is a member of, and an upper bound on .

Theorem 2.3 ().

There is an Aggregation Algorithm that solves any Aggregation Problem in time , w.h.p.

From a very high level, the algorithm works as follows. First, packets are sent to random nodes of the topmost level of the butterfly. Then, packets belonging to the same aggregation group are routed to an intermediate target in the bottommost level of the butterfly using a (pseudo-)random hash function and a variant of the random rank routing protocol (Aleliunas, 1982; Upfal, 1982). Whenever two packets belonging to the same aggregation group collide on a butterfly node, they are combined using the function . Finally, the result of aggregation group is sent from its intermediate target to its actual target .

The intermediate steps of the algorithm are synchronized using a variant of the Aggregate-and-Broadcast algorithm: Every node delays its participation in an aggregation until having finished the current step. Once the aggregation finishes, all nodes become informed about a common round to start the next step. Termination of the routing protocol can easily be determined by passing down tokens in the butterfly. We also use the same techniques to achieve synchronization for all other algorithms in this paper without explicitly mentioning it.

Note that common hash functions require shared randomness. Although in the remainder of this paper we assume that all hash functions behave like perfect random functions, it can be shown that it suffices to use -wise independent hash functions (see, e.g., (Celis et al., 2013) and the references therein): Whenever we aim to show that the outcome of a random experiment deviates from the expected value by at most , w.h.p., we can immediately use Lemma 2.1; if the deviation we aim to show is higher, we can partition events in a suitable way so that we only need -wise independence for each subset of events, and the sum of the deviations does not exceed the overall desired deviation. To agree on such hash functions, all nodes have to learn random bits. This can be done by letting the node with identifier broadcast messages, each consisting of bits, to all other nodes using the butterfly.

Multicast Tree Setup Problem.

We are given a set of multicast groups , , with sources such that each node is source of at most one multicast group (but possibly member of multiple groups). The goal is to set up a multicast tree in the butterfly for each with root , which is a node uniformly and independently chosen among the nodes of the bottommost level of the butterfly, and a unique and randomly chosen leaf in the topmost level for each . Let , and define the congestion of the multicast trees to be the maximum number of trees that share the same butterfly node. We require that each node knows the identifier and source of all multicast groups it is a member of.

Theorem 2.4 ().

There is a Multicast Tree Setup Algorithm that solves any Multicast Tree Setup Problem in time , w.h.p. The resulting multicast trees have congestion , w.h.p.

The algorithm shares many similarities with the Aggregation Algorithm; in fact, the multicast trees stem from the paths taken by the packets during an aggregation. Alongside the aggregation, every butterfly node records for every all edges along which packets from group arrived during the routing towards , and declares them as edges of .

Multicast Problem.

Assume we have constructed multicast trees for a set of multicast groups , , with sources such that each node is source of at most one multicast group. The goal is to let every source send a message to all nodes . Let be the congestion of the multicast trees and . We require that the nodes know an upper bound on .

Theorem 2.5 ().

There is a Multicast Algorithm that solves any Multicast Problem in time , w.h.p.

The algorithm multicasts messages by sending them upwards the multicast trees, performing our routing strategy in ”reverse order”. We remark that similar to the Aggregation Algorithm, the Multicast Algorithm may easily be extended to allow a node to be source of multiple multicasts; however, we will only need the simplified variant in our paper.

Multi-Aggregation Problem.

We are given a set of multicast
groups , , with sources such that every source stores a multicast packet , and every node is source of at most one multicast group. We assume that multicast trees for the multicast groups with congestion have already been set up. The goal is to let every node receive for a given distributive aggregate function .

Theorem 2.6 ().

There is a Multi-Aggregation Algorithm that solves any Multi-Aggregation Problem in time , w.h.p.

The Multi-Aggregation algorithm combines all of the previous algorithms to allow a node to first multicast a message to a set of nodes associated with it, and then aggregate all messages destined at it. More precisely, each source first multicasts its packet to all leaves in its multicast tree. Every node then maps to a packet for all and . The resulting packets are randomly distributed among the nodes of the topmost level of the butterfly. Finally, all packets associated with identifier for some are aggregated towards an intermediate target on level using the aggregate function as in the Aggregation Algorithm. From there, the result is finally delivered to . For applications beyond our paper, the algorithm may also be extended to allow nodes to be source of multiple multicast groups, and to receive aggregates corresponding to distinct aggregations.

3. Minimum Spanning Tree

As a first example of graph algorithms for the Node-Capacitated Clique, we describe an algorithm that computes a minimum spanning tree (MST) in time . More specifically, for every edge in the input graph , one of its endpoints eventually knows whether the edge is in the MST or not. We assume that each edge of has an integral weight in for some positive integer .

High-Level Description.

From a high level, our algorithm mimics Boruvka’s algorithm with Heads/Tails clustering, which works as follows. Start with every node as its own component. For iterations, every component (1) finds its lightest, i.e., minimum-weight, edge out of the component that connects to the other components, (2) flips a Heads/Tails coin, and (3) learns the coin flip of the component on the other side of the lightest edge. If flips Tails and flips Heads, then the edge connecting to is added to the MST, and thus effectively component merges with component (and whatever other components that are merging with simultaneously). It is well known that, w.h.p., all nodes get merged into one component within iterations and the added edges form an MST (see, e.g., (Ghaffari and Haeupler, 2016; Ghaffari et al., 2017)).

Details of the Algorithm.

Over the course of the algorithm, each component maintains a leader node whose identifier is known to every node in the component. Furthermore, we maintain a multicast tree for each component with source and corresponding multicast group . We ensure that the set of multicast trees has congestion . In each round of Boruvka’s algorithm with the partition of into components , every leader flips Heads/Tails and multicasts the result to all nodes in its component by using the Multicast Algorithm of Theorem 2.5. As the multicast trees have congestion , and as every node is in exactly one component, this takes time , w.h.p.

For each component , the leader then learns the lightest edge to a neighbor in in time . This is a highly nontrivial task that we address later. Afterwards, the leader multicasts the lightest edge to every node in its component, which can again be done in time . For each component that flips Tails, the node incident to the lightest outgoing edge now has to learn whether ’s component has flipped Heads, and, if so, the identifier of . Therefor, joins a multicast group with source , i.e., declares itself a member of and constructs multicast trees with the help of Theorem 2.4. As every node is member of at most one multicast group, setting up the corresponding trees with congestion takes time , w.h.p. By using the Multicast Algorithm, the endpoints of all lightest edges learn the result of the coin flip and the identifier of their adjacent component’s leader in time .

If for the edge the component of has flipped Heads, then sends the identifier of the leader of to its own leader, which in turn informs all nodes of using a multicast. Note that thereby only learns that is an edge of the MST, but not . Finally, the multicast trees of the resulting components are rebuilt by letting each node join a multicast group corresponding to its new leader. As the components are disjoint, the resulting trees with congestion are built in time , w.h.p.

Finding the Lightest Edge.

To find the lightest edge of a component, we “sketch” its incident edges. Our algorithm follows the procedure FindMin of (King et al., 2015), with the “broadcast-and-echo” subroutine inside each component replaced by multicasts and aggregations (i.e., executions of the Multicast and Aggregation Algorithm) from/to the leader to/from the entire component. As argued above, and due to Theorem 2.3, both steps can be performed in time , w.h.p. We highlight the main steps of FindMin, and refer the reader to (King et al., 2015) for the details and proof.

Initially, we bidirect each edge into two arcs in opposite directions, and define the identifier , where denotes the concatenation of two binary strings. We will apply binary search to the weights of edges so that we can find the lightest outgoing edge. Every iteration has a current range such that the lightest edge out has weight in that range. To compute the next range, the algorithm determines whether there is an edge out of , where . If so, the new range becomes ; otherwise, the new range is .333The algorithm FindMin of (King et al., 2015) actually uses a “-ary” search instead of binary search, but we replace it with binary search here for simplicity of explanation. The remaining task is to solve the following subproblem: given a range , determine whether there exists an outgoing edge with weight in .

To sketch their incident edges, the nodes use a (pseudo-)random hash function that maps each edge identifier to . For a node , define


and for component , define and similarly. Observe that the unordered sets and are the same if and only if component does not have an outgoing edge with weight in the range . Also, the hash function satisfies the property that, if two sets of integers are not equal, then the values of and are not equal with constant probability. To compute the values of and , each node computes and , and an aggregation towards the leader node is performed in each component with addition mod 2 as the aggregate function. We can repeat this procedure times so that w.h.p., there is no outgoing edge out of with weight in if and only if and are equal in every trial. Note that this requires the nodes to know different hash functions; by the discussion in Section 2.2, the necessary bits can be retrieved beforehand in rounds.

The running time analysis from (King et al., 2015), modified to count the number of “broadcast-and-echo” subroutines, can be rewritten as follows.

Lemma 3.1 ((King et al., 2015), Lemma 2).

The leader node of each component learns the lightest edge out of its component within iterations of multicasts and aggregations, w.h.p.

Since each iteration can be performed in time , and there are phases of Boruvka’s algorithm, w.h.p., we conclude the following theorem.

Theorem 3.2 ().

The algorithm computes an MST in time , w.h.p.

4. Computing an -Orientation

One of the reasons the MST problem can be solved very efficiently is because we only require one endpoint of each edge to learn whether the edge is in the MST or not; otherwise, the problem seems to become significantly harder, as every node would have to learn some information about each incident edge. We observe this difficulty for the other graph problems considered in this paper as well. To approach this issue, we aim to set up multicast trees connecting each node with all of its neighbors in , allowing us to essentially simulate variants of classical algorithms. As we will see, such trees can be set up efficiently if has small arboricity by first computing an -orientation of , which is described in this section.

We present the Orientation Algorithm, which computes an -orientation in time , w.h.p. More specifically, the goal is to let every node learn a direction of all of its incident edges in . The algorithm essentially constructs a Nash-Williams forest-decomposition (Nash-Williams, 1964) using the approach of (Barenboim and Elkin, 2010). From a high-level perspective, the algorithm repeatedly identifies low-degree nodes and removes them from the graph until the graph is empty. Whenever a node leaves, all of its incident edges are directed away from it. More precisely, the algorithm proceeds in phases . Let be the number of incident edges of a node that have not yet been assigned a direction at the beginning of phase . Define to be the average degree of all nodes with , i.e., . In phase , a node is called inactive if , active if , and waiting if . In each phase, an edge gets directed from to , if is active and is waiting, or if both nodes are active and . Thereby, each node is waiting until it becomes active in some phase, and remains inactive for all subsequent phases. This results in a partition of the nodes into levels , where level is the set of active nodes in phase . The lemma below follows from the fact that in every phase, at least half of all nodes that are not yet inactive become inactive, which can easily be shown, and that , since any subgraph of can be partitioned into forests, whose average degree is at most .

Lemma 4.1 ().

The Orientation Algorithm takes phases to compute an -orientation.

4.1. Identification Problem

It remains to show how a single phase can be performed efficiently in our model. Here, the main difficulty lies in having active nodes determine which of their neighbors are already inactive in order to conclude the orientations of incident edges. We approach this problem by solving the following Identification Problem: We are given a set of learning nodes and a set of playing nodes. Every playing node knows a subset of its neighbors that are potentially learning, i.e., it knows that none of the other neighbors are learning. The goal is to let every learning node determine which of its neighbors are playing.

To solve such a problem, we present the Identification Algorithm, which will later be used as a subroutine. In this subsection, we represent each edge by two directed edges and . We assume that all nodes know (pseudo-)random hash functions for some parameters and . The hash functions are used to map every directed edge to trials. We say an edge participates in trial if for some .

Let . We refer to an edge as a red edge of , if is not playing, and a blue edge of , if is playing. We identify each edge by the identifiers of its endpoints, i.e., . Let be the XOR of the identifiers of all edges that participate in trial , and be the XOR of the identifiers of all blue edges that participate in trial . Furthermore, let be the total number of edges adjacent to that participate in trial , and let be the number of blue edges that participate in trial .

Our idea is to let use these values to identify all of its red edges; then it can conclude which of its neighbors must be playing. Before describing this, we explain how the values are determined. Clearly, the values and can be computed by by itself for all . The other values are more difficult to obtain as does not know which of its edges are blue. To compute these values, we use the Aggregation Algorithm: Each playing node is in aggregation group for every potentially learning neighbor and every trial such that participates in trial . The input of for the group is , where the first coordinate is used to let compute , and the second coordinate is used to compute . Correspondingly, the aggregate function combines two inputs corresponding to the same aggregation group by taking the XOR of the first coordinate and the sum of the second coordinate. Thereby, eventually receives both and .

We now show how can identify its red edges using the aggregated information. First, it determines a trial for which . Since neighbors that are not playing did not participate in the aggregation, in this case there is exactly one red edge such that is included in but not in . Therefore, can be retrieved by taking the XOR of both values. Having identified , determines all trials in which participates using the common hash functions and ”removes” from by again computing the XOR of both. It then decreases by 1 and repeats the above procedure until no further edge can be identified. If always finds a trial for which , then it eventually has identified all red edges. Clearly, all the remaining neighbors must be playing.

Lemma 4.2 ().

Let and assume that is incident to at most red edges. Let be the number of hash functions, and be the number of trials.

for and .


fails to identify at least red edges if at some iteration of the above procedure there are edges left such that all edges participate only in trials in which at least two of the edges participate. Here, the edges participate in at most many different trials, since otherwise there must be a trial in which only one edge participates. Therefore, the probability for that event is

where the last inequality holds because

4.2. Details of the Algorithm

Finally, we show how the Identification Algorithm can be used to efficiently realize a phase of the high-level algorithm in time , w.h.p. In our algorithm every node learns the direction of all its incident edges in the phase in which it is active; however, its neighbors might learn their direction only in subsequent phases. Each phase is divided into three stages: In Stage 1, every node determines whether it is active in this phase. In Stage 2, every active node learns which of its neighbors are inactive. Finally, in Stage 3 every active node learns which of its remaining neighbors, which must be either active or waiting, are active. From this information, and since every node knows the identifiers of all of its neighbors, every active node concludes the direction of each of its incident edges. In the following we describe the three stages of a phase in detail.

Stage 1: Determine Active Nodes.

We assume that all nodes start the stage in the same round. First, every node that is not inactive needs to compute (i.e., minus the number of inactive neighbors) to determine whether it remains waiting or becomes active in this phase. This value can easily be computed using the Aggregation Algorithm: Every inactive node , which already knows the orientation of each of its incident edges, is a member of every aggregation group such that . As the input value of each node we choose , the aggregate function is the sum, and . By performing the Aggregation Algorithm, determines the number of inactive neighbors, and, by subtracting the value from , computes . Afterwards, the nodes use the Aggregate-and-Broadcast Algorithm to compute and to achieve synchronization.

Stage 2: Identify Inactive Neighbors.

The goal of this stage is to let every active node learn which of its neighbors are inactive. The stage is divided into two steps: In the first step, a large fraction of active nodes succeeds in the identification of inactive neighbors. The purpose of the second step is to take care of the nodes that were unsuccessful in the first step, i.e., that only identified some, but not all, of their incident red edges. In both steps we use the Identification Algorithm described in the previous section, and carefully choose the parameters to achieve that each step only takes time .

At the beginning of the first step, the nodes compute by performing the Aggregate-and-Broadcast Algorithm. Let , which is a value known to all nodes, and note that . Then, the nodes perform the Identification Algorithm, where the active nodes are learning and the inactive nodes are playing. Hence, the endpoints of the red edges learned by the active nodes must either be active or waiting. If we chose and for some constant as parameters, then by Lemma 4.2 all nodes would learn all of their red edges, w.h.p., already in this step; however, this would take time . To reduce this to , we instead choose and for some constant , and accept that nodes fail to identify some of their red edges in this step. However, for this choice Lemma 4.2 implies that each node fails to identify at most red edges, w.h.p.

We now describe how these remaining edges are identified in the second step. Let . We divide into sets of high-degree nodes and of low-degree nodes and consider the nodes of each set separately. By dealing with high-degree nodes separately, we ensure that the global load required to let low-degree nodes identify their red edges reduces by a factor. First, the nodes of (of which there are only , w.h.p.) broadcast their identifiers by using a variant of the Aggregate-and-Broadcast Algorithm: Using the path system of the butterfly, every node sends its identifier to the node with identifier ; however, messages are not combined. Instead, whenever multiple identifiers contend to use the same edge in the same round, the smallest identifier is sent first. After has received all identifiers, it broadcasts them in a pipelined fashion, i.e., one after the other, to all other nodes. For every node define , i.e., is a red edge of for all . Let . For each , chooses a round from uniformly and independently at random and sends its own identifier to in that round. Afterwards, every high-degree node can identify all of its red edges. As , this takes time , w.h.p.

To let the low-degree nodes identify their red edges, we again use the Identification Algorithm. First, in order to narrow down its set of potentially learning neighbors, every inactive node determines which of its neighbors are unsuccessful low-degree nodes. Therefore, we let every inactive node join multicast group for all such that is not inactive (recall that every inactive node knows the directions of all of its incident edges, and whether the other endpoint of each edge is inactive or not). Every node then informs its inactive neighbors by using the Multicast Algorithm. Since every node is member of at most multicast groups, which is a value known to all nodes, the nodes know an upper bound on as required by the algorithm. Having narrowed down the set of learning nodes and the sets of potentially learning neighbors to the unsuccessful ones only, the Identification Algorithm is performed once again. As the parameters of the algorithm we choose and for some constant .

Stage 3: Identify Active Neighbors.

Finally, every active node has to learn which of the endpoints of its red edges are active. In the following, let be the identifier of an edge given by its endpoints and such that . The nodes use two (pseudo-)random hash-function , , where maps the identifier of an edge to a node uniformly and independently at random, and maps its identifier to a round uniformly and independently at random. Every active node sends an edge-message containing to in round for every incident edge leading to an active or waiting node. Using this strategy, two adjacent active nodes , send an edge-message containing to the same node in the same round. Whenever a node receives two edge-messages with the same edge identifier, it immediately responds to the corresponding nodes, which thereby learn that both endpoints are active.

4.3. Analysis

We now turn to the analysis of the algorithm. We mainly show the following lemma:

Lemma 4.3 ().

In phase of the algorithm, every node learns the directions of its incident edges. Each phase takes time , w.h.p. In every round, each node sends and receives at most messages, w.h.p.

We present the proof in three parts: first, we show the correctness of the algorithm, then analyze its runtime, and finally show that every node receives at most messages in each round.

Lemma 4.4 ().

In the first step, every active node fails to identify at most red edges, w.h.p.


Note that every active node can only be adjacent to at most active or waiting nodes, i.e., it is incident to at most red edges. Therefore, by Lemma 4.2, the probability that an active node fails to identify at least red edges is

Taking the union bound over all nodes implies the lemma. ∎

Lemma 4.5 ().

After the second step, every active node has identified all of its red edges, w.h.p.


If , then after having received the identifiers of all neighbors that are active or waiting, immediately knows its red edges. Now let . Since by Lemma 4.4 has at most remaining red edges, by Lemma 4.2 we have that the probability that fails to identify at most one of its remaining red edges is at most

Taking the union bound over all nodes implies the lemma. ∎

To bound the runtime of the complete algorithm, we now prove that each stage takes time , w.h.p.

Lemma 4.6 ().

Stage 1 takes time , w.h.p.


In the execution of the Aggregation Algorithm, every inactive node is member of at most aggregation groups and every active node is target of at most one aggregation, i.e., and . The lemma follows from Theorem 2.3. ∎

For the runtime of Stage 2 we need the following two lemmas.

Lemma 4.7 ().

, w.h.p.


Let . Note that since , we have that , and therefore . For let be the binary random variable that is , if is unsuccessful in the first step, and , otherwise. By Lemma 4.2 and since , we have

Let . is the sum of independent binary random variables with expected value . Let for some constant , then by using the Chernoff bound of Lemma 2.1 we get that

and thus , w.h.p. ∎

Lemma 4.8 ().

, w.h.p.


Let . For a node , let be the random variable that is , if is unsuccessful in the first step, and , otherwise. From the proof of Lemma 4.7, we have that . Let be the set of active nodes. Then is a sum of independent random variables with expected value . Note that for all . Therefore, we can use the Chernoff bound of Lemma 2.1 with for some constant , and get

Therefore, we have that , w.h.p. ∎

We are now ready to bound the runtime of Stage 2.

Lemma 4.9 ().

Stage 2 takes time , w.h.p.


The computation of at the beginning of the first step takes time . To perform the first execution of the Identification Algorithm, every node has to learn hash functions, which can be done in time (see Section 2.2). In the first execution of the Identification Algorithm, every active node is target of aggregation group for every trial , and every inactive neighbor of is member of all aggregation groups such that participates in trial . Therefore, every active node is target of at most and every inactive node is a member of at most aggregation groups. Since both values are known to every node, the nodes know an upper bound on . Since every inactive node is a member of at most aggregation groups, the global load is bounded by . By Theorem 2.3, the Aggregation Algorithm takes time

w.h.p., to solve the problem.

Now consider the second step. By Lemma 4.7, , w.h.p.. A simple delay sequence argument can be used to show that all identifiers are broadcasted within time . Informing each node in about its red edges takes an additional rounds, as for every node and .

The multicast trees to handle low-degree nodes are constructed in time , as every inactive node joins at most multicast groups, and the resulting trees have congestion , w.h.p. Correspondingly, the multicast can be performed in time , w.h.p.

We now bound the runtime of the final execution of the Identification Algorithm. First, note that the hash functions can be learned by broadcasting the bits required for each hash function (see Section 2.2) in a pipelined fashion in a binary tree, which is implicitly given in the network. Clearly, this takes time and requires each node to send and receive only messages in each round. Every inactive node is a member of at most aggregation groups, and every node is a target of at most aggregation groups. By Lemma 4.8 , w.h.p. As this is also a bound on the number of edges that participate in any trial, and each edge participates in trials, the global load is bounded by . Therefore, by Theorem 2.3, the Aggregation Algorithm takes time , w.h.p. ∎

The lemma below follows from the fact that .

Lemma 4.10 ().

Stage 3 takes time .

Finally, it remains to show that no node receives too many messages.

Lemma 4.11 ().

In each round of the algorithm, every node sends and receives at most messages, w.h.p.


By the discussion of Section 2.2, the executions of the Aggregation, Multicast Tree Setup, and Multicast Algorithm ensure that every node receives only messages in each round. It remains to show the claim for the second step of Stage 2, where high-degree nodes broadcast their identifiers and receive their red edges, and for Stage 3, where active nodes learn which of their red edges lead to other active nodes.

For the first part, note that after all high-degree nodes have broadcasted their identifiers, every active or waiting node sends out messages containing its identifier in every round, w.h.p., which can easily be shown using Chernoff bounds. Second, as every high-degree node receives at most identifiers, it also follows from the Chernoff bound that every such node receives at most messages in each round, w.h.p.

Now consider Stage 3 of the algorithm. Again, by using the Chernoff bound, it can easily be shown that no node sends out more than edge-messages in any round. Therefore, every node only receives response messages in every round. It remains to show that every node receives at most edge-messages in every round, from which it follows that it only sends out response messages in every round. Let and note that . Fix a node and a round and let be the binary random variable that is if and only if and for . Then . has expected value . Using the Chernoff bound we get that , w.h.p., which implies that receives at most edge-messages in round . The claim follows by taking the union bound over all nodes and rounds. ∎

Taking Lemma 4.3 together with Lemma 4.1 yields the final theorem of this section.

Theorem 4.12 ().

The Orientation Algorithm computes an -orientation in time , w.h.p.

5. Graph Problems Beyond MST

We conclude our initiating study of the Node-Capacitated Clique by presenting a set of graph problems that can be solved efficiently in graphs with bounded arboricity. The presented algorithms rely on a structure of precomputed multicast trees. More specifically, for every node we construct a multicast tree for the multicast group . Since such trees enable the nodes to send messages to all of their neighbors, in the following we refer to them as broadcast trees.

In a naive approach to construct these trees, one could simply use the Multicast Tree Setup Algorithm, where each node joins the multicast group of every neighbor. However, as , the time to construct these trees would be , which can be if is a star, for example. Instead, we first construct an -orientation of the edges as shown in the previous section, and let only join multicast groups (which translates to injecting one packet per group into the butterfly) for every out-neighbor . Additionally, for every out-neighbor it takes care of joining ’s multicast group by injecting a packet for . In case of a star for example (whose arboricity is one), every node, including the center, injects at most two packets. In general, we obtain the following result.

Lemma 5.1 ().

Setting up broadcast trees takes time , w.h.p. The congestion of the broadcast trees is , w.h.p.

The corollary below, which follows from the analysis of Theorem  2.6, establishes one of the key techniques used by the algorithms in this section.

Corollary 0 ().

Let . Using the broadcast trees, the Multi-Aggregation Algorithm solves any Multi-Aggregation Problem with multicast groups and for all in time