We study the distributed all-pairs shortest paths problem (APSP) defined on the CONGEST model of distributed network. A network is modeled by a weighted undirected -node graph .222As we will discuss later, we can also handle directed graphs. Each node represents a processor with unique ID and infinite computational power that initially only knows its adjacent edges and their weights. Nodes can communicate with each other in rounds, where in each round each node can send a message of size to each neighbor (weights play no role in the communication). The goal of APSP is for every node to know its distances from all other nodes. We want an algorithm that achieves this with smallest number of rounds, called time complexity. It is usually expressed in terms of and , where is the numer of nodes and is the diameter of the network when edge weights are omitted. Throughout we use , and to hide polylogarithmic factors in . See Section 3 for details of the model.
The approximate version of the problem was known to admit (i) a -approximation -time deterministic algorithm and (ii) an lower bound which holds even against randomized -approximation algorithms and when [LenzenP_stoc13, LenzenP-podc15, Nanongkai-STOC14]. The exact unweighted version was also settled with bound [LenzenP_podc13, HolzerW12, FrischknechtHW12, PelegRT12, AbboudCK16].333More precisely, the bound for the unweighted case is . The lower bound holds against -approximation algorithms when the network is unweighted. The same lower bound also holds even for the easier problem of approximating the network diameter [FrischknechtHW12]. For the exact weighted case, nothing was known until the 2017 bound of by Elkin [Elkin-STOC17], which was later improved to [HuangNS17]; both algorithms by [Elkin-STOC17, HuangNS17] are randomized. On the lower bound side, Censor-Hillel, Khoury, and Paz [Censor-HillelKP17] pushed the bound to and proved that the standard lower bound technique cannot provide a super-linear lower bound. Despite this, it was still widely open whether there was a new technique that implies a super-linear lower bound, or whether we can in fact solve the exact weighted case in time, like the approximate and the unweighted cases.
We present an randomized (Las Vegas) -time algorithm. This essentially settles the distributed APSP problem, with the key open remaining problem being whether deterministic algorithms can achieve the same bound. Like the previous -time algorithm, our algorithm works in a more difficult model where each node must send the same message to every neighbor in each round (broadcast CONGEST), and can handle a more general case of inputs: directed graphs with zero edge weights; in fact a standard reduction shows that our algorithm can also handle negative weights in time.
In addition to the improved running time, our algorithm works in a more general setting than that required by the previous bound. The previous -time algorithm of [HuangNS17] requires that (i) the communication is bidirectional (unaffected by edge directions), and (ii) edge weights are in . Although these are typical assumptions, some works have explored the possibilities to avoid them, e.g. [Elkin-STOC17, AgarwalRKP18-PODC, AgarwalR18-Arxiv-3/2, AgarwalR18-Arxiv2-4/3] (the second assumption was also mentioned in [HuangNS17] as their main drawback, since their guarantee depends on the number of bits needed to represent edge weights). If we do without both assumptions (so communication is only along edge directions , and edge weights are arbitrary as long as a distance can be sent through a link in one round), the previously best algorithm for this more difficult setting required time [AgarwalR18-Arxiv-3/2] 444We emphasize that in the case of uni-directional communication, node can learn its distance from only if there is a directed path from to ; otherwise, it is impossible for to learn such information.. If bidirectional communications are allowed, then the bound can be improved to [AgarwalR18-Arxiv2-4/3]. Our algorithm does not require any of the above assumptions, and our bound subsumes all above results, except that the -time algorithm in [AgarwalR18-Arxiv-3/2] is deterministic.
Our algorithm is also much simpler than the previous state-of-the-art. Given that our result is essentially optimal, we believe that its simplicity is a plus.
Other related works.
As noted earlier, one aspect left to understand distributed APSP is the performance of deterministic algorithms. The current best time for deterministic algorithms is , first achieved by Agarwal et al. [AgarwalRKP18-PODC] and later tailored to work without bidirectional communication by Agarwal and Ramachandran [AgarwalR18-Arxiv-3/2]. For a summary of previous algorithms and their properties, see [AgarwalR18-Arxiv2-4/3, Table 1].
Distributed APSP is sometimes referred to as name-independent routing schemes. See, e.g. [LenzenP_stoc13, LenzenP-podc15] for discussions and results on another variant called name-dependent routing schemes which is not considered in this paper. These papers also show an application of distributed APSP to routing tables constructions.
The previous lack of understanding for exact APSP in fact reflects a bigger issue in the field of distributed graph algorithms: Studies in the past few years have led to tight algorithms for several graph problems; for example, single-source shortest paths (SSSP), minimum cut, and maximum flow can be -approximated in time [HenzingerKN-STOC16, BeckerKKL16, Nanongkai-STOC14, NanongkaiS14_disc, GhaffariK13, GhaffariKKLP15]555For the maximum flow algorithm, there is an extra term in the time complexity., and the time bounds are tight up to polylogarithmic factors [DasSarmaHKKNPPW12, Elkin06, PelegR00, KorKP13]. In contrast, except for minimum spanning tree (e.g. [KuttenP98, PanduranganRS-STOC17, Elkin-arxiv17-mst]), not much was known for exact algorithms until 2017, when algorithms for exact SSSP and APSP started to appear (e.g. [GhaffariL18, ForsterN18, Elkin-STOC17, HuangNS17, AgarwalRKP18-PODC, AgarwalR18-Arxiv-3/2, AgarwalR18-Arxiv2-4/3]). Settling the exact cases for other problems remains a major open problem.
The cornerstone of our algorithm is a new technique called random filtered broadcasting. We give an overview in section 2; loosely speaking, the technique applies to settings where one needs to broadcast a large amount of information to every vertex, but in the end each vertex only cares about the “best” message it receives. We show how to use randomization to filter out most of the messages, and reduce the congestion on each edge. Although relatively simple, our result in this paper show the technique to be very powerful. It is also quite general, so we have strong reason to believe that it will find application in other distributed algorithms for the CONGEST model, especially those that relate to distances.
On a more concrete level, we use random filtered broadcasting to devise a primitive which leads to our APSP algorithm, but which we think may prove useful in its own right. In particular, given any sets of nodes (nodes know if they are in these sets) and assuming that every knows all distances from nodes in , and every node knows all distances from nodes in , we want every to know for every .666Note that we actually have to handle a bit more general case where not all distances from are known to nodes in . This was previously an obstacle for APSP. In this paper, we show how to do this in time. Armed with this black-box, we are able use a very natural framework for APSP. Additionally, if we only care about hop-distances at most , then we can reduce the number of rounds to . We hope that just as Bellman-Ford is often used as a primitive that allows one to separately handle shorter and longer hop-distances, our new algorithm for can be used as a primitive in other distributed shortest path algorithms.
Throughout the paper we only show that the output is correct with high probability. As discussed in[HuangNS17], this can be made Las Vegas since in time we can check the correctness, as follows. First, every node lets its neighbors know about its distances from other nodes (this takes time). Then, every node checks if it can improve its distance from any node using the distance knowledge from neighbors. If the answer is “no” for every node, then the computed distance is correct. If some node answers “yes”, it can broadcast its answer to all other nodes in time.
2 High-Level Overview
We start with a randomized hierarchy: for every integer , we construct set by independently sampling each vertex with probability ; we set and . Then with high probability: , and any shortest path with at least vertices contains a vertex from .
Our algorithm then proceeds in phases, following a standard framework for shortest path algorithms. We go from phase down to phase . The guarantee at the end of phase is that every vertex knows the shortest distances from each ; that is, . Let us now consider phase . The goal is for every node to learn all distances for and . First, each vertex in runs Bellman-Ford up to hop-distance : this gives us all distances for which . On the other hand, if is large, then we know that there exists a vertex on the shortest path . Note, moreover, that because of phase we already know ; it is also not hard to ensure that we know because of the Bellman-Ford computation from .
Thus, to complete the phase , all we have left is to solve the sub-problem : we assume that we already know distances from to and from to , and the goal is to compute for every and . (In fact the Bellman-Ford computation from each only gives us accurate distances to some of the , but this ends up having no effect, so for this overview we stick to the simpler description above.)
Note that DistThrough is a very natural problem in and of itself, and also comes up in many other shortest path algorithms. The issue is that it is not clear how to approach this problem in the distributed setting. The naive solution would be to have each broadcast for each . But this incurs a congestion of , which is only efficient when is relatively small. For this reason, previous algorithms had to deviate from the simple framework described above, and typically tried to balance two different approaches, one for small-hop distances, and one for large ones; in the former case, a Bellman-Ford-style approach is efficient, while for the latter case the relevant is small, and so a broadcasting-type-approach is efficient. However, such a trade-off necessarily results in a super-linear round complexity, such as the state of the art of .
Our main contribution is to show that can be solved in time, for any sets , regardless of their size. Not only does this lead to an optimal round complexity (up to log factors), but it also leads to a very clean and simple solution to the problem, as we are able to use the framework described above, without needing to balance multiple different approaches.
Random Filtered Broadcasting:
We solve by using a new technique that we refer to as random filtered broadcasting. We focus on a fixed , and show how to solve with only congestion on each edge; using theorem 3.3, we can then parallelize the algorithms for all in time . Let us consider the naive broadcasting approach again: each vertex in knows all distances from , so it sends a message for every . Whenever a vertex receives message , it can use its knowledge of to compute . Thus, if a vertex receives for all it can compute .
To reduce the congestion on each edge, we allow vertices to filter out certain message , i.e. to not pass them on to their neighbors. Consider the following filtering heuristic
filtering heuristic: if a vertex sees a message , but has previously seen a message with , then does not pass on the message . This reduces the total number of messages sent, and each still correctly computes ; the reason is that if is the vertex in that minimizes , then it is not hard to see that every node on will pass on message (or some equivalently good message, in case of a tie.)
Unfortunately in the worst-case the congestion might be no better than before, as each vertex might receive the messages in the worst possible order – that is, in decreasing order of ; in this case, will pass on every message it sees. To overcome this, we use a randomized filter. We let , and obtain each by sampling each node in with probability . Our algorithm then proceeds in iterations, starting from down to . In iteration , we broadcast all message for ; however, as in the above paragraph, a vertex filters out messages unless they are strictly better than all previous messages seen by – i.e. unless is smaller. The basic argument is that with high probability, will filter out all but messages in iteration ; the reason is that if we look at the that are “best” for , then with high probability at least one of them is in , and so was already seen iteration in , and will filter out all messages not in the top . We thus have a total congestion of per iteration, and so congestion to compute , and time for .
3.1 The CONGEST Model
The communication network is modeled by an undirected unweighted -node -edge graph , where nodes model the processors and edges model the bounded-bandwidth links between the processors. Let and denote the set of nodes and (directed) edges of , respectively. The processors (henceforth, nodes) are assumed to have unique IDs in the range of and infinite computational power. Typically nodes’ IDs are assumed to be in the range of . But as observed in [HuangNS17], in time the range can be reduced to . Each node has limited topological knowledge; in particular, it only knows the IDs of its neighbors and knows no other topological information (e.g., whether its neighbors are linked by an edge or not).
Nodes may also accept some additional inputs as specified by the problem at hand. For the case of graph problems, the additional input is typically edge weights. Let be the edge weight assignment.777Note that it might be natural to include as a possible edge weight. But this is not necessary since it can be replaced by a large weight of value . We refer to network with weight assignment as the weighted network, denoted by . The weight of each edge is known only to and .
We measure the performance of algorithms by its running time, defined as the worst-case number of rounds of distributed communication. At the beginning of each round, all nodes wake up simultaneously. Each node then sends an arbitrary message of bits through each edge , and the message will arrive at node at the end of the round. We assume that nodes always know the number of the current round for simplicity. In this paper, the running time is analyzed in terms of the number of nodes (). Since can be computed in time, where is the diameter of , we will assume that every node knows .
Remark on edge weights and directions:
Note that our algorithm in fact works in the most restricted model studied in the literature, where edge weights are “arbitrary”, edges are directed, and communications are unidirectional.
It was commonly assumed in the literature (e.g., [KhanP08, LotkerPR09, KuttenP98, GarayKP98, GhaffariK13, HuangNS17, GhaffariL18, ForsterN18]) that the maximum weight is ; so, each edge weight can be sent through an edge (link) in one round. A more general “arbitrary weight” model has been considered in, e.g., [Elkin-STOC17, AgarwalRKP18-PODC, AgarwalR18-Arxiv-3/2, AgarwalR18-Arxiv2-4/3]. In this model, edge weights can be arbitrary, and it is assumed that communication links have enough capacity to deliver a distance information in one round. Some algorithms do not work in this model, including the previously best -time algorithm [HuangNS17].
The case of directed graph has also been studied in the literature. One can consider further whether the communication is bidirectional, i.e. nodes can communicate on an edge regardless of its direction, or the more restricted unidirectional case, where the communication has to be done along edge directions. The previously best -time algorithm [HuangNS17] has to assume bidirectional communication.
3.2 Notation and Problem Definition
Let be a directed network with arbitrary non-negative weights: is the set of nodes, and the set of edges. Let and . Let denote the edge from to , and let be the weight of this edge. For every pair of nodes and in , let be the shortest distance from to in . Note that since the underlying graph is directed, we might have . Let refer to the shortest path from to ; if there are multiple such paths, choose one of the shortest paths with the minimal number of edges. Let be the number of edges on .
Throughout the algorithm, each vertex will maintain for every
various distance estimates. When we refer to such estimates, the superscript will always refer to the node that possesses this knowledge.
Definition 3.1 (All-pairs shortest paths (APSP)).
An algorithm for distributed APSP must terminate with every vertex knowing a value , for every .
We now define a notion of accuracy for the local information at .
Definition 3.2 (-hop-accurate).
For any positive integer , We say that a distance estimate is -hop-accurate if the following holds: 1) and 2) if then . Note: if , then -hop accuracy guarantees .
We say that an event holds with high probability (w.h.p.) if it holds with probability at least , where is an arbitrarily large constant.
3.3 Distributed Algorithmic Primitives
The Bellman-Ford Algorithm.
This well-known algorithm computes SSSP from a source on a network . The algorithm runs for rounds, where is an input given by the user. The algorithm offers the following guarantee: upon termination, is -hop-accurate for every node in . See Appendix A for a brief description of the algorithm.
Scheduling of Distributed Algorithms.
Consider distributed algorithms . Let dilation be such that each algorithm finishes in dilation rounds if it runs individually. Let congestion be such that there are at most congestion messages, each of size , sent through each edge (counted over all rounds), when we run all algorithms together. We note the following result of Ghaffari [Ghaffari15-scheduling]:
Theorem 3.3 ([Ghaffari15-scheduling]).
There is a distributed algorithm that can execute altogether in time.
Negative edge weights
: If the original graph has negative weights (and no negative-weight cycles), then we can use the idea of reduced weights from Johnson’s algorithm [Johnson77] to transform the graph into a new graph with non-negative edge weights that has the same shortest paths as the original graph. The transformation requires rounds. We can thus assume for the rest of the paper that weights are non-negative. See Appendix B for more details.
4 The Algorithm
Define . Let , and for each , select each node to with probability (every node knows whether it is in or not). Let . (Note that we do not require .) The following facts following from standard techniques.
W.h.p., the following holds for every .
for a large enough constant and for every pairs of nodes and such that , the shortest path contains a node in .
Our algorithm runs in phases, starting from down to . In each phase, we execute Algorithm 1. At the end of phase , every knows distance for every . Since , the algorithm terminates with knowledge of APSP.
In Algorithm 2, we describe the RandFilteredBroadcast algorithm. Note that in our main algorithm, the set of between-nodes is always equal to some ; but the RandFilteredBroadcast subroutine in fact works for an arbitrary set , so we describe in its full generality.
4.1 Correctness of the Main Algorithm (Algorithm 1)
In this subsection we show that the output of phase : for every pair of nodes and , knows . Since for every , it is enough to show that for every . It is clear that because every distance returned by our algorithm corresponds to some path in the graph; we now complete the proof by showing that .
Consider any fixed pair and , and let be the constant in Lemma 4.1.
Case 1: .
Case 2: .
4.2 Correctness of RandFilteredBroadcast (Algorithm 2)
For any and every node , when Iteration terminates knows ; here we define .
It is clear that , because only considers values of the form . The harder direction is to show that .
We prove this by induction on . The base case for is trivial, since we define . For the induction step, we assume that the Lemma holds for , and will show that it holds for iteration as well.
Let us fix some particular node . We now consider two cases; the first is much simpler.
. Intuitively, this is the case that the “best" between-node for in is no better than the best node from ; since we know that Lemma 4.2 holds for iteration (inductive hypothesis), the assumption of Case 1 directly ensures that it also holds for iteration .
. The rest of the proof is concerned with this case. Let be any vertex in , and say that , for some length .
Because of the case assumption, we know that for any we have . Moreover, it is not hard to see that because is a shortest path, we must also have
Now, the intuition behind the proof of Lemma 4.2 is that should receive the message , which by choice of will ensure that sets . The reason we expect this message to travel all the way to is because by Equation 1, for every , is a better between node for than all , so message will pass the filter in Step 2b of Algorithm 2. The one issue with this proof is that some may fail to pass along if it passed along an equally good message in an earlier round of iteration ; but this is still fine, as we will show that because is a shortest path, this message is also good for .
For every , by the end of Round of Iteration , has received a message for some such that
The proof is by induction on the number of . Node sends in Round of Iteration , so the claim is obviously true for . We now assume assume that the claim is true for some , with , and show that the claim must hold for as well. Let us consider the first message received by for which Equation 2 is satisfied; by the induction hypothesis, receives this message at some time , and moreover, by Equation 1, this event first occurs in iteration ; it could not have occurred in an earlier iteration . Thus, by Step 2b of Algorithm 2), we know that at time , set to , and sent message to all its neighbors, including
Thus, receives message at time . Observe that:
|(by input condition of Algorithm 2)|
|(by triangle inequality)|
|(by input condition of Algorithm 2)|
|(by Equation 2)|
|(by input condition of Algorithm 2)|
|(since is on the shortest -path)|
as desired. This concludes the proof of Lemma 4.2. ∎
We do not need this for our main result, but we note that if the input had the additional guarantee that all hop-distances between and were at most , then we would only need to run RandFilteredBroadcast for rounds instead of ; this is because in case 2 of the proof, rounds would suffice for the message to propagate from to .
4.3 Complexity of RandFilteredBroadcast (Algorithm 2)
The time complexity of Algorithm 2 is clearly , since there are iterations, and each is specified to run for rounds. Now we show that the algorithm creates low congestion on every edge, and thus can be easily parallelized.
W.h.p., every node sends to its neighbors messages of the form in each iteration of Algorithm 2.
Observe that if , then will not send to neighbors in Step 2b of Iteration , because by lemma 4.2, at the end of the previous iteration () we will already have . Observe further that the definition of does not depend on the randomness used to sample ; thus, since each is sampled into with probability , we have:
In other words, w.h.p. sends message only for some nodes for which . Lemma 4.5 follows from the fact that there are such nodes. (To see this, order nodes in increasingly by the values of (break tie arbitrarily). Observe that for the node in this order, contains all nodes that appear before in the order.) ∎
Over all iterations, Algorithm 2 terminates in rounds, and incurs a congestion of on each edge.
4.4 Complexity of the Main Algorithm (Algorithm 1)
Recall that Algorithm 1 runs in phases. We now analyze the complexity of an individual phase: summing over all the phases completes the proof of our main result.
W.h.p phase of Algorithm 1 terminates in rounds
Step 1 of Algorithm 1 does not require any communication, so all that remains is to analyze the number of rounds required for all the calls to RandFilteredBroadcast in Step 1. The algorithm runs instances of RandFilteredBroadcast in parallel. By corollary 4.6 each runs in rounds and incurse congestion per edge. Thus the total congestion is , so using the parallel scheduler in theorem 3.3 yields a total round complexity of , as desired. ∎
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme under grant agreement No 715672. Nanongkai was also partially supported by the Swedish Research Council (Reg. No. 2015-04659.)
Appendix A Bellman-Ford
Since it figures prominently in our main algorithm, we now describe the well-known Bellman-Ford algorithm for computing SSSP from a source on network [Bellman58, Ford56]. We omit the analysis of the algorithm, since it can be found in the citations. The algorithm runs for rounds, where is an input given by the user.
For any node , let denote the knowledge of about . Initially, for every node , except that . The algorithm proceeds as follows.
In round 0, every node sends to all its neighbors.
When a node receives the message about from its neighbors , it uses the new information to decrease the value of if .
If decreases, then node sends the new value of to all its neighbors.
Repeat (ii) and (iii) for rounds.
Clearly, the above algorithm takes rounds. Moreover, it can be proved that when the algorithm terminates is -hop-accurate for every node in .
Appendix B Non-negative weights
In this section we show that if the original graph has negative weights but no non-negative cycles, we can in rounds transform it to a graph that has exactly the same shortest path structure, but has non-negative weights. This justifies the assumption of non-negative weights in section 3. (If the graph has a negative cycle, then the algorithm will discover this cycle within rounds.)
Our transformation directly follows the technique of reduced costs used in Johnson’s APSP algorithm in the static setting [Johnson77]. The algorithm will compute a node value for every node such that the following property is satisfied: for every edge , . We show how to compute the values later. Once these values are computed, the algorithm creates a new edge-weight function , where . Let the graph with the weight function instead of , and let be the shortest distance in . It is to easy to see that satisfies the following properties:
for every edge .
for every pair of nodes and we have .
Thus overall algorithm proceeds as follows. First it executes process compute-, described below: at the end of this process, each vertex knows its own value . Then each vertex broadcasts to the entire graph: by lemma B.1, this takes a total of rounds.
Lemma B.1 (Broadcasting [Peleg00_book]).
Suppose each holds messages of bits each, for a total of messages. Then all nodes in the network can receive these messages within rounds.
The algorithm then executes the main distributed APSP algorithm described in this paper on on instead of : by Property 1 of