# Fast Approximate Shortest Paths in the Congested Clique

We design fast deterministic algorithms for distance computation in the congested clique model. Our key contributions include: -- A (2+ϵ)-approximation for all-pairs shortest paths in O(^2n / ϵ) rounds on unweighted undirected graphs. With a small additional additive factor, this also applies for weighted graphs. This is the first sub-polynomial constant-factor approximation for APSP in this model. -- A (1+ϵ)-approximation for multi-source shortest paths from O(√(n)) sources in O(^2n / ϵ) rounds on weighted undirected graphs. This is the first sub-polynomial algorithm obtaining this approximation for a set of sources of polynomial size. Our main techniques are new distance tools that are obtained via improved algorithms for sparse matrix multiplication, which we leverage to construct efficient hopsets and shortest paths. Furthermore, our techniques extend to additional distance problems for which we improve upon the state-of-the-art, including diameter approximation, and an exact single-source shortest paths algorithm for weighted undirected graphs in Õ(n^1/6) rounds.

• 28 publications
• 11 publications
• 8 publications
• 11 publications
04/15/2018

### A Deterministic Distributed Algorithm for Exact Weighted All-Pairs Shortest Paths in Õ(n^3/2) Rounds

We present a deterministic distributed algorithm to compute all-pairs sh...
03/06/2020

### Exponentially Faster Shortest Paths in the Congested Clique

We present improved deterministic algorithms for approximating shortest ...
07/28/2021

### (1+ε)-Approximate Shortest Paths in Dynamic Streams

Computing approximate shortest paths in the dynamic streaming setting is...
05/30/2022

### Fully Polynomial-Time Distributed Computation in Low-Treewidth Graphs

We consider global problems, i.e. problems that take at least diameter t...
10/26/2020

### Distance Computations in the Hybrid Network Model via Oracle Simulations

The Hybrid network model was introduced in [Augustine et al., SODA '20] ...
05/13/2021

### On Sparsity Awareness in Distributed Computations

We extract a core principle underlying seemingly different fundamental d...
07/26/2019

### Almost Shortest Paths and PRAM Distance Oracles in Weighted Graphs

Let G=(V,E) be a weighted undirected graph with n vertices and m edges, ...

## 1 Introduction

Computing distances in a graph is a fundamental task widely studied in many computational settings. Notable examples are computation of all-pairs shortest paths (APSP), single-source shortest paths (SSSP), and computing specific parameters such as the diameter of a graph. In this work, we study distance computations in the Congested Clique model of distributed computing.

In the Congested Clique model, we have a fully-connected communication network of nodes, where nodes communicate by sending -bit messages to each other node in synchronous rounds. The Congested Clique model has been receiving much attention during the past decade or so, due to both its theoretical interest in focusing on congestion alone as a communication resource, and its relation to practical settings that use fully connected overlays [13, 42, 14, 31, 47, 21, 32, 29, 28, 50, 51, 52, 36, 33, 40, 43, 41]. In particular, there have been many recent papers studying distance problems in Congested Clique [13, 42, 14, 48, 36, 52, 8, 24].

### 1.1 Distance computation in the congested clique

Many state-of-the art results for distance computations in the Congested Clique model exploit the well-known connection between computing distances and matrix multiplication [13, 42, 14]. Specifically, the th power of the adjacency matrix of a graph , taken over the min-plus or tropical semiring (see e.g. [13] for details), correspond to shortest-path distances. Hence, iteratively squaring a matrix times allows computing all the distances in the graph. This approach gives the best known algorithms for APSP in the Congested Clique, including (1) round algorithm for exact APSP in weighted directed graphs [13], (2) round algorithms for exact APSP in unweighted undirected graphs, and -approximate APSP in weighted directed graphs [13], as well as (3) round algorithm for exact APSP in directed graphs with constant weights [42]. Additionally, in [14], this connection is used to show an improved APSP algorithm for sparse graphs.

For approximating the distances, faster approximations for larger constants can be obtained by computing a -spanner, which is a sparse graph that preserves distances up to a multiplicative factor of , and having all nodes learn the entire spanner. Using the Congested Clique spanner constructions of [52], this approach gives a -approximation for APSP in rounds, which is still polynomial for any constant .

This raises the following fundamental question:

###### Question 1.

Can we obtain constant-factor approximations for APSP in sub-polynomial time?

If we restrict our attention to SSSP, sub-polynomial constant approximation is indeed possible by a recent gradient-descent-based algorithm that obtains a -approximation in polylog ) rounds [8]. However, this algorithm provides distances only from a single source.

### 1.2 Our contributions

##### All-pairs shortest paths.

As our first main result, we address the above fundamental question by providing the first polylogarithmic constant approximations for APSP in the Congested Clique model. Specifically, we show the following.

###### Theorem 2.

There is a deterministic -approximation algorithm for unweighted undirected APSP in the Congested Clique model that takes rounds.

We also obtain a nearly -approximation in rounds in weighted

undirected graphs, in the sense that for any distance estimate

, there is further additive error in the approximation, where is the weight of the heaviest edge on the shortest - path.

Our approximation is almost tight for sub-polynomial algorithms in the following sense. As noted by [41], a -approximate APSP in unweighted undirected graphs is essentially equivalent to fast matrix multiplication, so obtaining a better approximation in complexity below would result in a faster algorithm for matrix multiplication in the Congested Clique. Likewise, sub-polynomial-time algorithm with any approximation ratio for directed graphs would give a faster matrix multiplication algorithm [20], so our results will likely not extend to directed graphs.

##### Multi-source shortest paths.

As our second main result, we show a fast -approximation algorithm for the multi-source shortest paths problem (MSSP), which is polylogarithmic as long as the number of sources is . Specifically, we show the following.

###### Theorem 3.

There is a deterministic -approximation algorithm for the weighted undirected MSSP that takes

 O((|S|2/3n1/3+logn)⋅lognϵ)

rounds in the Congested Clique, where is the set of sources. In particular, the complexity is as long as .

This is the first sub-polynomial algorithm that obtains such approximations for a set of polynomial size. Other advantages of our approach, compared to the previous -approximation SSSP algorithm [8], is that it is based on simple combinatorial techniques. In addition, our complexity improves upon the complexity of [8].

##### Exact SSSP and diameter approximation.

In addition to the above, our techniques allow us to obtain a near -approximation for the diameter in rounds as well as an -round algorithm for exact weighted SSSP, improving the previous -round algorithm [13]. All our algorithms are deterministic.

### 1.3 Our techniques

The main technical tools we develop for our distance computation algorithms are a new sparse matrix multiplication algorithm, extending the recent result of [14], and new deterministic hopset construction algorithm for the Congested Clique.

##### Distance products.

We start from the basic idea of using matrix multiplication to compute distances in graphs. Specifically, if is the weighted adjacency matrix of a graph , it is well known that distances in can be computed by iterating the distance product , defined as

 (A⋆A)[i,j]=mink(A[i,k]+A[k,j]),

that is, the matrix multiplication over the min-plus semiring.

A simple idea is to apply the recent sparse matrix multiplication algorithm of [14], with running time that depends on the density of the input matrices. In particular, this allows us to multiply two sparse matrices with non-zero entries in rounds; note that for the distance product, the zero element is . However, using this algorithm for computing distances directly is inefficient, as can be dense even if is sparse (e.g. a star graph), and hence iterative squaring is not guaranteed to be efficient. Moreover, our goal is to compute distances in general graphs, not only in sparse graphs. Nevertheless, we show that while using [14] directly may not be efficient, we can use sparse matrix multiplication as a basic building block for distance computation in the Congested Clique.

##### Our distance tools.

The key observation is that many building blocks for distance computation are actually based on computations in sparse graphs or only consider a limited number of nodes. Concrete examples of such tasks include:

• -nearest problem: Compute distances for each node to the closest nodes in the graph.

• -source detection problem: Given a set of sources , compute the distances for each to nearest sources using paths of at most hops.

• distance through sets problem: Given a set of nodes and distances to all nodes in , compute the distances between all nodes using paths through nodes in .

For all of these problems, there is a degree of sparsity we can hope to exploit if or are small enough. For example, the -source detection problem, requires the multiplication of a dense adjacency matrix and a possibly sparse matrix, depending on the size of . However, for any of polynomial size the algorithm in [14] is polynomial. An interesting property in this problem, though, is that the output matrix is also sparse. If we look at the -nearest problem, both input matrices are sparse, hence we can use the previous sparse matrix multiplication algorithm. However, this does not exploit the sparsity of this problem to the end: in this problem we are interested only in computing the nearest nodes to each node, hence there is no need to compute the full output matrix. The challenge in this case is that we do not know the identity of the closest nodes before the computation. To exploit this sparsity we design new matrix multiplication algorithms, that in particular have the ability to sparsify the matrix throughout the computation, and get a complexity that depends only on the size of the output we are interested in.

##### Sparse matrix multiplication.

To compute the above, we design new sparse matrix multiplication algorithms, which differ from [14] by taking into account also the sparsity of the output matrix. For matrix , let denote the density of , that is, the average number of non-zero entries on a row. Specifically, for distance product computation , we obtain two variants:

• One variant assumes that the sparsity of the output matrix is known.

• The other sparsifies the output matrix on the fly, keeping only smallest entries for each row.

In both of these scenarios, we obtain running time of

 O((ρSρTρP)1/3n2/3+1)

rounds, improving the running time of the prior sparse matrix multiplication for .

This allows us to obtain faster distance tools, by taking into account the sparsity of the output:

• We can solve the -nearest problem in rounds.

• We can solve the -source detection problem in rounds, where is the number of edges in the output graph; note that dependence on becomes linear in order to exploit the sparsity.

In concrete terms, with these output-sensitive distance tools we still get subpolynomial running times even when the parameters are polynomial. For example, we can get the distances to the closest nodes in rounds. Note that though our final results are only for undirected graphs, these distance tools work for directed weighted graphs.

##### Hopsets.

An issue with our -source detection algorithm is that in order to exploit the sparsity of the matrices, we must perform multiplications to learn the distances of all the nodes at hop-distance at most from . Hence, to learn the distances of all the nodes from , we need to do multiplications, which is no longer efficient. To overcome this challenge, we use hopsets, which are a central building block in many distance computations in the distributed setting [34, 48, 24, 22, 23]. A -hopset is a sparse graph such that the -hop distances in give -approximations of the distances in . Since it is enough to look only at -hop distances in , using our source detection algorithm together with a hopset allows getting an efficient algorithm for approximating distances, as long as is small enough.

However, the time complexity of all current hopset constructions depends on the size of the hopset [24, 23, 34], in the following way. The complexity of building an hopset of size is at least . This is a major obstacle for efficient shortest paths algorithms, since based on recent existential results there are no hopsets where both and are polylogarithmic [1] (see Section 1.4.) Nevertheless, we show that our new distance tools allow to build hopsets in a time that does not depend on the hopset size. In particular, we show how to implement a variant of the recent hopset construction of Elkin and Neiman [24] in rounds. The size of our hopset is , hence constructing it using previous algorithms requires at least rounds.

##### Applying the distance tools.

As a direct application of our source detection and hopset algorithms, we obtain a multi-source shortest paths (MSSP) algorithm, allowing to compute -approximate distances to sources in polylogarithmic time. Our MSSP algorithm, in turn, forms the basis of a near -approximation for the diameter, and a -approximation for weighted APSP. To obtain a -approximation for unweighted APSP, the high-level idea is to deal separately with paths that contain a high-degree node and paths with only low-degree nodes. A crucial ingredient in the algorithm is showing that in sparser graphs, we can actually compute distances to a larger set of sources efficiently, which is useful for obtaining a better approximation. Our exact SSSP algorithm uses our algorithm for finding distances to the -nearest nodes, which allows constructing efficiently the -shortcut graph described in [48, 22].

##### Distance computation in the congested clique.

APSP and SSSP are fundamental problems that are studied extensively in various computation models. Apart from the MM-based algorithms in the Congested Clique [13, 42, 14], previous results include also -round algorithms for exact SSSP and -approximation for APSP [36, 48], that work in the more restricted broadcast Congested Clique model, where nodes send the same message to all the nodes in each round. In this model, it is known that obtaining a -approximation for APSP requires rounds [36], which is almost tight. Other distance problems studied in the Congested Clique are construction of hopsets [24, 23, 34] and spanners [52].

##### Matrix multiplication in the congested clique.

As shown by [13], matrix multiplication can be done in Congested Clique in rounds over semirings, and in rounds over rings, where is the exponent of the matrix multiplication [27]. For rectangular matrix multiplication, [42] gives faster algorithms. The first sparse matrix multiplication algorithms for Congested Clique were given by [14], as discussed above.

##### Distance computation in the CONGEST model.

The distributed CONGEST model is identical to the Congested Clique model, with the difference that the communication network is identical to the input graph , and nodes can communicate only with their neighbours in each round. Distance computation is extensively studied in the CONGEST model. The study of exact APSP in weighted graphs has been the focus of many recent papers [12, 38, 3, 22], culminating in a near tight algorithm [12]. Such results were previously known in unweighted graphs [37, 46, 53] or for approximation algorithms [45, 48]. Approximate and exact algorithms for SSSP are studied in [8, 48, 44, 26, 34, 30, 22]. While near-tight algorithms exist for approximating SSSP [8], there is currently a lot of interest in understanding the complexity of exact SSSP and directed SSSP [26, 30, 22]. The source detection problem is studied in [46], demonstrating the applicability of this tool for many distance problems such as APSP and diameter approximation in unweighted graphs. An extension for the weighted case is studied in [45]. Algorithms and lower bounds for approximating the diameter are studied in [46, 42, 53, 37, 2].

##### Distance computation in the sequential setting.

Among the rich line of research in the sequential setting, we focus only on the most related to our work. The pioneering work of Aingworth et al. [4], inspired much research on approximate APSP [20, 7, 6, 18, 10] and approximate diameter [54, 16, 5, 10], with the goal of understanding the tradeoffs between the time complexity and approximation ratio. Many of these papers use clustering ideas and hitting set arguments as the basis of their algorithms, and our approximate APSP and diameter algorithms are inspired by such ideas.

##### Hopsets.

Hopsets are a central building block in distance computation and are studied extensively in various computing models [34, 48, 24, 22, 23, 17, 11, 35, 55]. The most related to our work are two recent constructions of Elkin and Neiman [24], and Huang and Pettie [39], which are based on the emulators of Thorup and Zwick [56], and are near optimal due to existential results [1]. Specifically, [39] construct -hopsets of size with , where recent existential results show that any construction of -hopsets with worst case size must have , where is an integer and . For a detailed discussion of hopsets see the introduction in [24, 23].

### 1.5 Preliminaries

##### Notations.

Except when specified otherwise, we assume our graphs are undirected with non-negative edge weights. Given a graph and , we denote by the distance between and in , and by the length of the shortest path of hop-distance at most between and in . If is clear from the context, we use the notation for

##### Routing and sorting.

As basic primitives, we use standard routing and sorting algorithms for the Congested Clique model. In the routing task, each node holds up to messages of bits, and we assume that each node is also the recipient of at most messages. In the sorting task, each node has a list of entries from an ordered set, and we want to sort these entries so that node receives the th batch of entries according to the global order of all the input entries. Both of these task can be solved in rounds [19, 43].

##### Semirings and matrices.

We assume we are operating over a semiring , where is the identity element for addition and is the identity element for multiplication. Note that we do not require the multiplication to be commutative. For the Congested Clique algorithms, we generally assume that the semiring elements can be encoded in messages of bits.

All matrices are assumed to be over the semiring . For convenience, we identify with the node set , and use set to index the matrix entries. For matrix , we denote the matrix entry at position by . For sets , we denote by by submatrix obtained by taking rows and columns restricted to and , respectively.

##### Hitting sets.

Let be a set of size at least . We say that is a hitting set of if in each subset there is a node from . We can construct hitting sets easily by adding each node to

with probability

. This gives a hitting set of expected size , such that w.h.p there is a node from in each subset . The same parameters are obtained by a recent deterministic construction of hitting sets in the Congested Clique [52], which gives the following (see Corollary 17 in [52]).

###### Lemma 4.

Let be a set of subsets of size at least , such that is known to . There exists a deterministic algorithm in the Congested Clique model that constructs a hitting set of size in rounds.

##### Partitions.

We will use the following lemmas on partitioning a set of weighted items into equal parts. Note that all the lemmas are constructive, that is, they also imply a deterministic algorithm for constructing the partition in a canonical way.

###### Lemma 5 ([14]).

Let be natural numbers, and let , and be natural numbers such that , for all , and divides . Then there is a partition of into sets of size such that

 ∑i∈Ijwi≤W/k+w for all j.
###### Lemma 6.

Let be natural numbers, and let , and be natural numbers such that , for all . Then there is a partition of into sets such that for each , the set consist of consecutive elements, and

 ∑i∈Ijwi≤W/k+w.
###### Proof.

Construct sets in the partition by starting from the first element, adding new elements to the current set until . Since all weights are at most , we have for all , and since , this process generates at most sets. ∎

###### Lemma 7.

Let and be natural numbers, and let , , and be natural numbers such that and , and we have and and for all . Then there is a partition of into sets such that for each , the set consist of consecutive elements, and

 ∑i∈Ijwi≤2(W/k+w)and∑i∈Ijui≤2(U/k+u).
###### Proof.

We begin by applying Lemma 6 separately to sequences and . That is, by Lemma 6, there exist the following two partitions of set into sets of consecutive indices:

• Partition such that for any , we have .

• Partition such that for any , we have .

Let , and let be the last elements of sets from partitions and in order.

Define sets as ; intuitively, this corresponds to taken the fenceposts of the partitions and , and taking every other fencepost to give a new partition into sets of consecutive indices. Clearly form a partition of , and since each overlaps at most two sets from and at most two sets from , we have and . ∎

## 2 Matrix multiplication

### 2.1 Output-sensitive sparse matrix multiplication

Our first matrix multiplication result is an output-sensitive variant of sparse matrix multiplication. In the matrix multiplication problem, we are given two matrices and over semiring , and the task is to compute the product matrix ,

 P[u,v]=∑w∈VS[v,w]T[w,u].

Following [13, 42], we assume for concreteness that in the Congested Clique model, each node receives the row and the column as a local input, and we want to compute the output so that each node locally knows the row of the output matrix.

##### Matrix densities.

For matrix , we denote by the number of non-zero entries in . Furthermore, we define the density as the smallest positive integer satisfying .

When discussing the density of a product matrix , we would like to simply consider the density ; however, for technical reasons, we want to ignore zero entries that appear due to cancellations. Formally, let be the binary matrix defined as

 ^S[i,j]={1if S[i,j]≠0,0if S[i,j]=0,

and define similarly. Let , where the product is taken over the Boolean semiring. We define the density of the product , denoted by , as the smallest positive integer satisfying . Note that for most of our applications, we operate over semirings where no cancellations can occur in additions, in which case .

We also note that while we assume that the input matrices are

, we can also use this framework for rectangular matrix multiplications, simply by padding the matrices with zeroes to make them square.

##### Sparse matrix multiplication algorithm.

Our main result for sparse matrix multiplication is the following:

###### Theorem 8.

Matrix multiplication can be computed deterministically in

 O((ρSρTˆρST)1/3n2/3+1)

rounds in Congested Clique, assuming we know beforehand.

We note that for all of our applications, the requirement that we know beforehand is satisfied. However, the algorithm can be modified to work without knowledge of with the additional cost of multiplicative factor; we simply start with estimate , restarting the algorithm with doubled estimate if the running time for current estimate is exceeded.

The rest of this section gives the proof of Theorem 8. We start by describing the overall structure of the algorithm, and then detail the different phases of the algorithm individually.

#### 2.1.1 Algorithm description

##### Algorithm parameters.

We define the algorithm parameters , and as

 a=(ρTˆρSTn)1/3ρ2/3S,b=(ρSˆρSTn)1/3ρ2/3T,c=(ρSρTn)1/3ˆρ2/3ST.

These parameter will control how we split the matrix multiplication task into independent subtasks. To see why these parameters are chosen in this particular way, we note that we will require that , and the final running time of the algorithm will be rounds, which is optimized by the selecting , and as above; this gives the running time in Theorem 8.

For simplicity, we assume that , and are integers. If not, taking the smallest greater integer will cause at most constant overhead.

##### Algorithm overview.

Our algorithm follows the basic idea of the classical 3D matrix multiplication algorithm, as presented for Congested Clique by [13], and as adapted for the sparse matrix multiplication by [14]. That is, we want to reduce the matrix multiplication task into smaller instances of matrix multiplication, and use a single node for each one of these. However, due to the fact that we are working with sparse matrices, we have to make certain considerations in our algorithm:

• Whereas the 3D matrix multiplication splits the original multiplication into products of matrices, we would ideally like to split into products of shape .

• Unlike in the dense case, we also have to make sure that all of our products are equally sparse. While this could be achieved using randomization similarly to the triangle detection algorithms of [49, 15], we want to do this deterministically.

With above considerations in mind, we now give an overview of our sparse matrix multiplication algorithm:

We compute a partition of the matrix multiplication task into subtasks , where , so that each is a matrix, and are submatrices of and , respectively, and we have and for all . This step takes rounds. (Section 2.1.2.) Each node learns the matrices and , and computes the product . Note that after this step, some of the matrices may be very dense. This step takes rounds. (Section 2.1.3.) We balance the output matrices so that each node holds values that need to be summed to obtain the final output matrix . This is achieved by duplicating those subtasks where the output is too dense. This step takes rounds. (Section 2.1.4.) The intermediate values obtained in Step 4 are summed together to obtain the output matrix . This step takes rounds. (Section 2.1.5.)

Note that the total running time of the above algorithm will be rounds, which by the choice of , and is as stated in Theorem 8. Note that Steps (1) and (2) are essentially streamlined versions of corresponding tools from [14], while Steps (3) and (4) are new.

#### 2.1.2 Cube partitioning

We say that a subcube of is a set of form , where . Note that such a subcube corresponds to a matrix multiplication task . Thus, a partition of the cube into subcubes corresponds to a partition of the original matrix multiplication into smaller matrix multiplication tasks, as discussed in the overview.

###### Lemma 9.

There is a Congested Clique algorithm running in rounds that produces globally known a partition of into disjoint subcubes such that for each subcube , we have , and the total number of non-zero entries is

1. in the submatrix , and

2. in the submatrix .

###### Proof.

We start by partitioning the input matrices into equally sparse ‘slices’. Specifically, we do the following:

1. All nodes broadcast the number of non-zero entries on of . Based on this information, all nodes deterministically compute the same a partition of into sets of size such that ; such partition exists by Lemma 5.

2. Using the same procedure as above, the nodes compute a partition of into sets of size such that .

There are now pairs , each corresponding to a subcube . We now partition each of these subcubes in parallel as follows. First, we partition the nodes into sets nodes, each such set corresponding to a pair of indices . The final partition is now computed as follows:

1. Nodes redistribute the input matrices so that node holds column of and row of . This can be done in rounds.

2. In parallel for each and , node sends the number of non-zero elements in and to all nodes in .

3. Nodes in compute a partition of into sets of consecutive indices such that the number of non-zero entries in is and the number of non-zero entries in is ; such partition exists by Lemma 7.

4. For each , the th node in broadcasts the first and last index of to all other nodes, allowing all nodes to reconstruct these sets.

The subcubes are now known globally, and they satisfy the requirements of the claim by construction. ∎

#### 2.1.3 Intermediate products

##### Balancing.

As an auxiliary tool, we want to solve a balancing task where each node has weighted entries with weights that satisfy

 ∑i,jwij≤W,wij≤n,

and the task is to re-distribute the entries so that each node has entries with total weight . Concretely, we assume each weighted entry consists of the weight and bits of data.

###### Lemma 10.

The above balancing task can be solved in rounds in Congested Clique.

###### Proof.

As a first step, we globally learn the distribution of different weights, and compute a globally known ordering for the weighted entries:

1. All nodes send the number of entries with weight exactly to node , with node handling entries with both weight and . Node broadcasts the total to all other nodes.

2. Nodes sort the weighted entries using Lenzen’s sorting algorithm [43].

Since all nodes know the distribution of the weights, all nodes can now locally compute what entries other nodes hold. This gives us a globally consistent numbering for the weighted entries, which we can use to solve the balancing task:

1. Nodes locally compute a partition of weighted entries into sets of size with total weight ; such partition exists by Lemma 5.

2. Each set is assigned to a separate node. Nodes redistribute the entries so that each node receives their assigned set in rounds.

All steps clearly take rounds. ∎

##### Computing intermediate products.

We now show how to compute the intermediate products given by the cube partition of Lemma 9. The following lemma is in fact more general; we will use it as a subroutine in subsequent steps of the algorithm.

###### Lemma 11.

Let be a partition of as in Lemma 9, and let be a (not necessarily bijective) function that is known to all nodes. There is a Congested Clique algorithm running in rounds such that after the completion of the algorithm, each each node locally knows the product

 Pσ(v)=S[VSσ(v),VPσ(v)]T[VPσ(v),VTσ(v)].
###### Proof.

For each , define as

 wij={|{v∈V:(i,j)∈VSσ(v)×VPσ(v)}|, if S[i,j] is non-zero, and0, otherwise,

that is, is the number of times appears in matrices , or if is a zero entry. Clearly we have , and since the partition satisfies the conditions of Lemma 9, we have that

 ∑i,jwij=∑v∈Vnz(S[VSσ(v),VPσ(v)])=O(ρSn2/a+n2).

All nodes can compute the values locally for their own row, since they depend only on the partition and function .

We now distribute the entries of input matrix so that each node learns the matrix :

1. Using Lemma 10, we balance the input entries so that each node holds entries with total weight . Specifically, for each entry a node receives, it receives the value along with the index .

2. Since the nodes know the partition and function , each node computes to which nodes it needs to send each of the entries it received in the first step. Since each entry is duplicated times, each node needs to send messages, and dually, each node needs to receive a submatrix of with entries. These messages can be delivered in rounds.

By identical argument, each node can learn the matrix in rounds and compute locally. ∎

#### 2.1.4 Balanced intermediate products

We say that an intermediate value in the matrix multiplication is a value

 pvWu=S[v,W]T[W,u]=∑w∈WS[v,w]T[w,u].

That is, an intermediate value is a partial sum of products for a single position of the output matrix. For concreteness, we encode these in congested clique implementation as tuples .

###### Lemma 12.

There is a Congested Clique algorithm running in rounds such that after the completion of the algorithm,

1. each node holds non-zero intermediate values, and

2. each non-zero elementary product in the matrix multiplication is included in exactly one intermediate value held by some node.

###### Proof.

As the first part of the algorithm, we compute all the intermediate product matrices and learn their densities:

1. Compute a partition of using Lemma 9.

2. Applying Lemma 11 with , each node computes the matrix

 Pv=S[VSv,VPv]T[VPv,VTv].

This takes rounds.

3. Each node broadcasts the number of non-zero entries in to all other nodes.

Next, we want to balance the dense intermediate product matrices between multiple nodes. We cannot do this directly, but we can instead duplicate the products:

1. Construct a function so that for each with , there are at least values satisfying . To see that this is possible, we observe that by the definition of , there are at most positions where matrices can have non-zero entries and each such position is duplicated times in the partition of the cube , implying that . Thus, we have

 ∑v∈V⌊nz(Pv)ˆρSTn/c⌋≤∑v∈Vnz(Pv)ˆρSTn/c=1ˆρSTn/c∑v∈Vnz(Pv)≤ˆρSTn2/cˆρSTn/c=n.

This step can be done locally using information obtained in the first part of the algorithm.

2. Apply Lemma 11 with . This takes rounds.

3. For each , each node with or assumes responsibility for entries of the matrix and discards the rest. More specifically, node determines such that is the th node responsible for , splits the non-zero entries of into parts and selects the th part; if both and , then selects two parts. This step can be done locally based on information obtained earlier.

After the completion of the algorithm, each node has intermediate values from at most two matrices . The total running time is rounds. ∎

#### 2.1.5 Balanced summation

###### Lemma 13.

Assume that the non-zero intermediate values of the matrix multiplication have been computed as in Lemma 12. Then there is a Congested Clique algorithm running in rounds that computes the output matrix .

###### Proof.

We start by initializing each row of the output matrix to all zero values. Our objective is to accumulate the intermediate values to this initial matrix, with each node being responsible for the row of the output matrix.

All nodes split their intermediate values into sets of intermediate values. We then repeat the algorithm times, each repetition accumulating one set of intermediate values from each node:

1. Nodes sort the intermediate values being processes globally by matrix position. This takes rounds using Lenzen’s sorting algorithm.

2. Each node locally sums all intermediate products it received corresponding to the same position.

3. Each node broadcasts the minimum and maximum matrix position it currently holds. Nodes use this information to deduce if the same matrix position occurs in multiple nodes. If so, all sums corresponding to that position are sent to the smallest id node having that position; each node sends at most one sum and receives at most one sum from each other node, so this step takes rounds.

4. If a node received new values from other nodes, these are now added to the appropriate sum.

5. All nodes now hold sums corresponding to at most matrix positions. Using Lenzen’s routing algorithm, we redistribute these so that node obtains sums corresponding to positions on row ; this takes rounds. Node then accumulates these sums to the output matrix.

Clearly, after completion of all repeats, we have obtained the output matrix . Since each repeat of the accumulation algorithm takes rounds, and there are repeats, the whole process takes rounds. ∎

### 2.2 Matrix multiplication with sparsification

Our second matrix multiplication result is a variant of sparse matrix multiplication where we control the density of the output matrix.

##### Problem definition.

In this section, we assume our semiring satisfies the following conditions that

1. there is a total order on , and

2. the addition operation satisfies , where is taken in terms of order .

Let and be the input matrices over the semiring and denote . In the filtered matrix multiplication problem, we are given the input matrices and along with an output density parameter , and the task is to compute a matrix such that each row of retains smallest entries of , that is,

1. each row of has at most non-zero entries,

2. if is non-zero, then , and

3. if is zero, then .

##### Filtered matrix multiplication algorithm.

We now prove an analogue of Theorem 8 for the filtered matrix multiplication problem. The high-level proof idea is largely the same, the main exception being that we perform two filtering steps, one after the computation of the smaller products, and second after the computation of the output matrix, to remove all but smallest entries.

Our result for sparsification is the following:

###### Theorem 14.

Filtered matrix multiplication can be computed in

 O((ρSρTρ)1/3n2/3+1)

rounds in the Congested Clique.

###### Proof.

Let , and be defined as

 a=(ρTρn)1/3ρ2/3S,b=(ρSρn)1/3ρ2/3T,c=(ρSρTn)1/3ρ2/3

as before, and let and . Note that we now have

 ρSn/^a≤ρSn/aandρn/^a≤ρn/c. (1)

We now follow the same algorithm as for Theorem 8 with the minor modifications:

1. We first compute the a partition of the cube using Lemma 9 with parameters , and .

2. Using Lemma 11, each node computes the intermediate product matrix

 Pv=S[VSv,VPv]T[VPv,VTv],

which takes rounds. Note that by Lemma 9, each has rows.

3. Each node locally computes matrix by removing all but smallest entries from each row of . Since has rows, it has non-zero entries.

4. Using Lemma 13, the nodes merge the intermediate product matrices into a matrix . This takes rounds.

5. Each node locally removes all but the smallest entries on row of to obtain the final output matrix .

The running time of the algorithm is rounds. By (1), this is rounds. ∎

## 3 Distance tools

In this section, we use our matrix multiplication algorithms to construct basic distance computation tools that will be used for our final distance computation algorithms. Though we only use the distance tools for undirected graphs, we note that they work also for directed graphs.

### 3.1 Distance products

##### Augmented min-plus semiring.

The general algorithmic idea for our distance tools is to apply the matrix multiplication algorithms over the min-plus semiring. However, to ensure that we get consistent results in terms of hop distances, a property we require for our distance -nearest and -source detection distance tools, we augment the basic min-plus semiring to keep track of the number of hops.

We define the augmented min-plus semiring to encode paths in distance computations as follows. The elements of are tuples , where

1. is a either the weight of an edge or a path, or , and

2. is a non-negative integer or , representing the number of number of hops.

Let be the lexicographical order on tuples , and define the addition operator as the minimum over the total order given by . The multiplication operation is defined as . It is easy to verify that is a semiring with idempotent addition, that is, for all . Moreover, the structure satisfies the conditions of Theorem 14.

##### Distance products.

We call the product of and over the augmented min-plus semiring the augmented distance product of and , and denote it by . In particular, for a graph , we define the augmented weight matrix by setting

 W[u,v]=⎧⎪⎨⎪⎩(0,0)if u=v,(w(u,v),1)if there is an edge from u to v, and(∞,∞)otherwise.

As with the regular distance product, the th augmented distance product power gives the distances for all pairs of nodes using paths of at most hops, as well as the associated number of hops.

Finally, we observe that the augmented distance product gives a consistent ordering in terms of distance from , in the following sense:

###### Lemma 15.

Let , and let be the shortest path of at most hops from to . Then for every node on the path , we have .

###### Proof.

It is sufficient to prove the claim for that is the last node on before , and the rest follows by a straightforward induction. Assume towards a contradiction that . If , then clearly because since is a shortest path. If , then the hop distance from to is larger by compared to the hop distance between and . In both cases, we have . ∎

##### Recovering paths.

As noted in [13], it is possible to recover a routing table from the distance product algorithms in addition to the distances. Specifically, it is easy to see that since the matrix multiplication algorithms explicitly compute the non-zero products, they can be modified to provide a witness for each non-zero entry in a product . That is, for any non-zero , we also get a witness such that . This allows us to obtain, for any distance estimate , a node such that the shortest path from to uses the edge , as discussed in [13].

### 3.2 k-nearest neighbors

In the -nearest problem, we are given an integers , and the task is to compute for each node the set of nearest nodes and distances to those nodes, breaking ties by first by hop distance and then arbitrarily. More formally, we want each node to compute a set of nodes and distances for all , such that the values for are the smallest on row of in terms of the order on augmented min-sum semiring .

Note that it follows immediately from Lemma 15 that all nodes are at most hops away from , and all nodes on the shortest path from to are also in .

###### Theorem 16.

The -nearest problem can be solved in

 O((kn2/3+1)logk)

rounds in Congested Clique.

###### Proof.

For matrix , let denote the matrix obtained by discarding all but the smallest values on each row of . We solve the -nearest problem as follows:

1. All nodes discard all but the smallest values on row to obtain .

2. We now observe that

 ¯¯¯¯¯¯¯¯W2=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯W⋆¯¯¯¯¯¯W,¯¯¯¯¯¯¯¯W4=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯W2⋆¯¯¯¯¯¯¯¯W2,…,¯¯¯¯¯¯¯¯Wk=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯Wk2⋆¯¯¯¯¯¯¯¯¯¯Wk2.