1 Introduction
Due to the prevalence of huge networks, scalable algorithms for fundamental graph problems recently have gained a lot of importance in the area of parallel computing. The Massively Parallel Computation () model [KSV10, GSZ11, BKS14, ANOY14, BKS17, CŁM18] constitutes a common abstraction of several popular largescale computation frameworks—such as MapReduce [DG08], Dryad [IBY07], Spark [ZCF10], and Hadoop [Whi12]—and thus serves as the basis for the systematic study of massively parallel algorithms.
While classic parallel (e.g., ) or distributed (e.g., ) algorithms can usually be implemented straightforwardly in in the same number of rounds [KSV10, GSZ11], the additional power of local computation (compared to ) or of global communication (compared to ) could potentially be exploited to obtain faster algorithms. Czumaj et al. [CŁM18] thus ask:
[hidealllines=true, backgroundcolor=gray!00] “Are the parallel round bounds “inherited” from the model tight? In particular, which problems can be solved in significantly smaller number of rounds than what the […] model suggest[s]?”Surprisingly, for maximal matching, one of the most central graph problems in parallel and distributed computing that goes back to the very beginning of the area, the answer to this question is not known. In fact, even worse, our understanding of the maximal matching problem in the model is rather bleak. Indeed, the only subpolylogarithmic^{1}^{1}1We are aware of an round lowmemory algorithm by Ghaffari and Uitto in a concurrent work [GU18]. However, only algorithms that are significantly faster than their counterparts—that is, smaller than any polynomial in , and ideally at most a polynomial in —are considered efficient. algorithm by Lattanzi, Moseley, Suri, and Vassilvitskii [LMSV11] requires the memory per machine to be substantially superlinear in the number of nodes in the graph. This is not only prohibitively large and hence impractical for massive graphs, but also allows an easy or even trivial solution for sparse graphs, which both are often assumed to be the case for many practical graphs[KSV10, CŁM18]. When the local memory is restricted to be (nearly) linear in , the round complexity of Lattanzi et al.’s algorithm drastically degrades, falling back to the trivial bound attained by the simulation of the round algorithm due to Luby [Lub85] and, independently, Alon, Babai, and Itai [ABI86].
For other classic problems, such as Maximal Independent Set (MIS) or approximate maximum matching, we have a slightly better understanding. There, the local memory can be reduced to be while still having round algorithms [ABB17, CŁM18, GGK18]. Yet, all these algorithms fail to go (substantially) below linear space without an (almost) exponential blowup in the running time. It is thus natural to ask whether there is a fundamental reason why the known techniques get stuck at the linearmemory barrier and study the lowmemory model with strongly sublinear space to address this question.
LowMemory Model for Graph Problems
We have machines with local memory of words each, for some ^{2}^{2}2Note that we do not require to be a constant. For the sake of simplicity of presentation, we decided to omit the dependency, which is a multiplicative factor of , in the analysis of the running time of our algorithms.. A graph with nodes, edges, and maximum degree is distributed arbitrarily across the machines. We assume the total memory in the system to be (nearly) linear in the input, i.e., . The computation proceeds in rounds consisting of local computation in all machines in parallel, followed by global communication between the machines. We require that the total size of sent and received messages of a machine in every communication phase does not exceed its local memory capacity. The main interest lies in minimizing the number of rounds, aiming for .
In this lowmemory model, the best known algorithms usually stem from straight forward simulations of and algorithms, thus requiring at least polylogarithmic rounds. In the special case of trees, [BFU18] managed to beat this bound by providing an round algorithm for MIS. The authors left it as a main open question whether there is a lowmemory algorithm for general graphs in rounds.
We make a step towards answering this question by devising a degree reduction technique that reduces the problems of maximal matching and maximal independent set in a graph with arboricity to the corresponding problems in graphs with maximum degree in rounds.
Theorem 1.1.
There is an round lowmemory algorithm that w.h.p.^{3}^{3}3As usual, w.h.p. stands for
with high probability
, and means with probability at least for any constant . reduces maximal matching and maximal independent set in graphs with arboricity ^{4}^{4}4With we mean subpolynomial in , i.e., for any constant . Note that for , our results for MIS and maximal matching follow directly, without any need for degree reduction. to the respective problems in graphs with maximum degree ^{5}^{5}5The purpose of the choice of all the constants in this work is merely to simplify presentation..This improves over the degree reduction algorithm by Barenboim, Elkin, Pettie, and Schneider [BEPS12, Theorem 7.2], which runs in rounds in the model and can be straightforwardly implemented in the lowmemory model.
Our degree reduction technique, combined with the stateoftheart algorithms for maximal matching and maximal independent set, gives rise to a number of lowmemory algorithms, as overviewed next. Throughout, we state our running times based on a concurrent work by Ghaffari and Uitto [GU18], in which they prove that maximal matching and maximal independent set can be solved in and lowmemory rounds, respectively, improving over and , respectively, which can be obtained by a sped up simulation [ASS18] of stateoftheart algorithms [BEPS12, Gha16]. We apply these algorithms by Ghaffari and Uitto as a black box. By using earlier results, the term in the running times of our theorem statements would get replaced by .
Theorem 1.2.
There is an round lowmemory algorithm that w.h.p. computes a maximal matching in a graph with arboricity .
This improves over the round algorithm by [BEPS12] and the round algorithm by [GU18]. We get the first round algorithm—and hence an almost exponential improvement over the state of the art, for linear as well as strongly sublinear space per machine—for all graphs with arboricity . This family of uniformly sparse graphs, also known as sparse everywhere graphs, includes but is not restricted to graphs with maximum degree , minorclosed graphs (e.g., planar graphs and graphs with bounded treewidth), and preferential attachment graphs, and thus arguably contains most graphs of practical relevance [GG06, OSSW18]. The previously known round algorithms either only worked in the special case of degree graphs [GU18], or required the local memory to be strongly superlinear [LMSV11].
To the best of our knowledge, all round matching approximation algorithms for general graphs (even if we allow linear memory) [CŁM18, ABB17, GGK18] heavily make use of subsampling, inevitably leading to a loss of information. It is thus unlikely that these techniques are applicable for maximal matching, at least not without a factor overhead. In fact, the problem of finding a maximal matching seems to be more difficult than finding a approximate maximum matching. Indeed, currently, an approximation can be found almost exponentially faster than a maximal matching, and the approximation ratio can be easily improved from any constant to using a reduction of McGregor [McG05].
Our result becomes particularly instructive when viewed in this context. It is not only the first maximal matching algorithm that breaks the linearmemory barrier for a large range of graphs, but it also enriches the bleak pool of techniques for maximal matching in the presence of low memory by one particularly simple technique.
Theorem 1.3.
There is an round lowmemory algorithm that w.h.p. computes a maximal independent set in a graph with arboricity .
This algorithm improves over the round algorithm that is obtained by simulating the algorithm of [BEPS12, Gha16] and over the round algorithm in the concurent work by Ghaffari and Uitto [GU18]. Moreover, for graphs with arboricity , our algorithm is the first round lowmemory algorithm. The previously known round algorithms for MIS either only worked in the special case of degree graphs [GU18] and trees [BFU18], or required the local memory to be [GGK18].
As a maximal matching automatically provides approximations for maximum matching and minimum vertex cover, Theorem 1.2 directly implies the following result.
Corollary 1.4.
There is an round lowmemory algorithm that w.h.p. computes a approximate maximum matching and a approximate minimum vertex cover in a graph with arboricity .
This is the first constantapproximation algorithm for matching and vertex cover—except for the and simulations by [BEPS12] and [Lub85, ABI86], respectively, as well as the concurrent work in [GU18]—that work with low memory. All the other algorithms require the space per machine to be either [LMSV11, CŁM18, ABB17, GGK18] or even strongly superlinear [LMSV11, AK17]. Corollary 1.4 generalizes the range of graphs that admit an efficient constantapproximation for matching and vertex cover in the lowmemory model from graphs with maximum degree [GU18] to uniformly sparse graphs with arboricity .
McGregor’s reduction [McG05] allows us to further improve the approximation to .
Corollary 1.5.
There is an round lowmemory algorithm that w.h.p. computes a approximate maximum matching, for any .
Due to a reduction by to Lotker, PattShamir, and Rosén [LPSR09], our constantapproximate matching algorithm can be employed to find a approximate maximum weighted matching.
Corollary 1.6.
There is an round lowmemory algorithm that w.h.p. computes a approximate maximum weighted matching, for any .
Concurrent Work
In this section, we briefly discuss an independent and concurrent work by Behnezhad, Derakhshan, Hajiaghayi, and Karp [BDHK18]. The authors there arrive at the same results, with the same round complexities and the same memory requirements. As in our work, their key ingredient is a degree reduction technique that reduces the maximum degree of a graph from to in rounds. While our degree reduction algorithm is based on (a variant of) the partition that partitions the vertices according to their degrees, the authors in [BDHK18] show how to implement (and speed up) the degree reduction algorithm by Barenboim, Elkin, Pettie, and Schneider [BEPS12, Theorem 7.2] in the lowmemory model. Note that both algorithms do not require knowledge of the arboricity , as further discussed in Remark 2.2.
2 Algorithm Outline and Roadmap
In the lowmemory setting, one is inevitably confronted with the challenge of locality: as the space of a machine is strongly sublinear, it will never be able to see a significant fraction of the nodes, regardless of how sparse the graph is. Further building on the ideas by [BFU18], we cope with this imposed locality by adopting local techniques—mainly inspired by the model [Lin92]—and enhancing them with the additional power of global communication, in order to achieve an improvement in the round complexity compared to the algorithms while still being able to guarantee applicability in the presence of strongly sublinear memory.
The main observation behind our algorithms is the following: If the maximum degree in the graph is small, algorithms can be simulated efficiently in the lowmemory model. Our method thus basically boils down to reducing the maximum degree of the input graph, as described in Theorem 1.1, and correspondingly consists of two parts: a degree reduction part followed by a simulation part. In the degree reduction part, which constitutes the key ingredient of our algorithm, we want to find a partial solution (that is, either a matching or an independent set) so that the remainder graph—i.e., the graph after the removal of this partial solution (that is, after removing all matched nodes or after removing the independent set nodes along with all their neighbors)—has smaller degree.
Lemma 2.1.
There are round lowmemory algorithms that compute a matching and an independent set in a graph with arboricity so that the remainder graph w.h.p. has maximum degree .
Note that this directly implies Theorem 1.1. Next, we show how Theorems 1.2 and 1.3 follow from Lemma 2.1 as well as from an efficient simulation of algorithms due to [GU18].
Proof of Theorems 1.2 and 1.3.
If , and hence , is at least polynomial in , we directly apply the algorithm by [GU18], which runs in rounds. Otherwise, we first apply the algorithm of Lemma 2.1 to obtain a partial solution that reduces the degree in the remainder graph to if , or to if . It runs in rounds. We then apply the algorithm by [GU18] on the remainder graph. This takes rounds. ∎
Remark 2.2.
Our degree reduction algorithm in Lemma 2.1 consists of several phases, each reducing the maximum degree by a polynomial factor, as long as the degree is still large enough.
Lemma 2.3.
There are round lowmemory algorithms that compute a matching and an independent set, respectively, in a graph with arboricity and maximum degree so that the remainder graph w.h.p. has maximum degree .
We first show that indeed iterated applications of this polynomial degree reduction lead to the desired degree reduction in Lemma 2.1.
Proof of Lemma 2.1.
We iteratively apply the polynomial degree reduction from Lemma 2.3, observing that as long as the maximum degree is still in and , we reduce the maximum degree by a polynomial factor from to in each phase, resulting in at most phases. ∎
It remains to show that such a polynomial degree reduction, as claimed in Lemma 2.3, indeed is possible. This is done in two parts. First, in Section 3, we provide a centralized algorithm for a polynomial degree reduction, and then, in Section 4, we show how to implement this centralized algorithm efficiently in the lowmemory model.
3 A Centralized Degree Reduction Algorithm
In this section, we present a centralized algorithm for the polynomial degree reduction as stated in Lemma 2.3. For details on how this algorithm can be implemented in the lowmemory model, we refer to Section 4. In Section 3.1, we give a formal description of the (centralized) algorithm. Then, in Section 3.2, we prove that this algorithm indeed leads to a polynomial degree reduction.
3.1 Algorithm Description
In the following, we set , and observe that as well as , due to the assumptions on in the lemma statement. We present an algorithm that reduces the maximum degree to . This algorithm consists of three phases: a partition phase, in which the vertices are partitioned into layers so that every node has at most neighbors in higherindex layers, a markandpropose phase in which a random set of candidates is proposed independently in every layer, and a selection phase in which a valid subset of the candidate set is selected as partial solution by resolving potential conflicts across layers.
Partition Phase
We compute an partition, that is, a partition of the vertices into layers so that every vertex has at most neighbors in layers with higher (or equal) index [NW61, NW64, BE10].
Definition 3.1 (Partition).
An partition with outdegree , defined for any , is a partition of the vertices into layers with the property that a vertex has at most neighbors in . We call the layer index of if . For neighbors and for , we call a parent of and a child of . If we think of the edges as being directed from children to parents, this gives rise to a partial orientation of the edges, with no orientation of the edges connecting vertices in the same layer.
Note that, for , such a partition can be computed easily by the following sequential greedy algorithm, also known as peeling algorithm: Iteratively, for , put all remaining nodes with remaining degree at most into layer , and remove them from the graph.
MarkandPropose Phase
We first mark a random set of candidates (either edges or vertices) and then propose a subset of these marked candidates for the partial solution as follows.
In the case of maximal matching, every node first marks an outgoing edge chosen uniformly at random and then proposes one of its incoming marked edges, if any, uniformly at random.
In the case of maximal independent set, every node marks itself independently with probability . Then, if a node is marked and none of its neighbors in the same layer is marked, this node is proposed. Note that whether a marked node gets proposed only depends on nodes in the same layer, thus on neighbors with respect to unoriented edges.
Selection Phase
The set of proposed candidates might not be a valid solution, meaning that it might have some conflicts (i.e., two incident edges or two neighboring vertices). In the selection phase, possible conflicts are resolved (deterministically) by picking an appropriate subset of the proposed candidates, as follows. Iteratively, for , all (remaining) proposed candidates in layer are added to the partial solution and then removed from the graph.
In the case of maximal matching, we add all (remaining) proposed edges directed to a node in layer to the matching and remove both their endpoints from the graph.
In the case of maximal independent set, we add all (remaining) proposed nodes in layer to the independent set and remove them along with their neighbors from the graph.
3.2 Proof of Correctness
It is easy to see that the selected solution is a valid partial solution, that is, that there are no conflicts. It remains to be shown that the degree indeed drops to . As the outdegree is bounded by , it is enough to show the following.
Lemma 3.2.
Every vertex with indegree at least gets removed or all but of its incoming edges get removed, with high probability.
Proof of Lemma 3.2 for matching.
Let be a node with degree at least . First, observe that if at least one incoming edge of is proposed, then an edge incident to (not necessarily incoming) will be selected to be added to the matching. This is because the only reason why would not select a proposed incoming edge is that has already been removed from the graph, and this happens only if its proposed edge has been selected to be added to the matching in a previous step. It thus remains to show that every vertex with indegree at least with high probability will have at least one incoming edge that has been proposed by the respective child. As every incoming edge of is proposed independently with probability at least , the probability of not having a proposed incoming edge is at most . A union bound over all vertices with degree at least concludes the proof. ∎
Proof of Lemma 3.2 for independent set.
Let be a vertex in layer that is still in the graph and has at least children after all layers with index have been processed. We show that then at least one of these children will be selected to join the independent set with high probability. Note that this then concludes the proof, as in all the cases either will be removed from the graph or will not have high indegree anymore. Moreover, observe that such a child of (that is still there after having processed layers ) will be selected to join the independent set iff it is proposed. This is because if it did not join even though it is proposed, then a parent of would had been selected to join the independent set, in which case would not have been part of the graph anymore, at latest after has been processed, thus would not count towards ’s high degree at that point.
Every such child of is marked independently with probability . The probability of being proposed and hence joining the independent set is at least , as it has at most neighbors in its layer, and it is proposed iff it is marked and none of its neighbors in the same layer is marked. Vertex thus in expectation has at least children that join the independent set.
Since whether a node proposes and hence joins the independent set depends on at most other nodes (namely ’s neighbors in the same layer), it thus follows from a variant of the Chernoff bounds for bounded dependence, e.g., from Theorem 2.1 in [Pem01], that the probability of having, say, neighbors that join the independent set is at most . A union bound over all vertices with degree at least concludes the proof.∎
4 Implementation of the Degree Reduction Algorithm in
In this section, we show how to simulate the centralized degree reduction algorithm from Section 3 in the lowmemory model. In Section 4.1, we show how to implement the partition phase efficiently in the lowmemory model. Then, in Section 4.2, we describe how to perform the simulation of the markandpropose as well as the selection phase. Together with the correctness proof established in Section 3.2, this will conclude the proof of Lemma 2.3.
The main idea behind the simulation is to use the wellknown graph exponentiation technique [LW10], which can be summarized as follows: Suppose that every node knows its hop neighborhood in iteration . Then, in iteration , each node can inform the nodes in its hop neighborhood of the topology of its hop neighborhood. Hence, every node can learn of its hop neighborhood in iteration , allowing it to simulate any round algorithm in rounds. Using this exponentiation technique, in principle, every round algorithm can be simulated in rounds. We have to be careful about the memory restrictions though. In iteration of the exponentiation process, we need to store a copy of the hop neighborhood of every node. If the neighborhood contains more than edges, we violate the local memory constraint. Similarly, the total memory of might not be enough to store each of these copies, even if every single copy fits to a machine. In order to deal with these issues, the key observation is that a large fraction of the nodes in the graph are contained in the first layers of the partition. In particular, we show that if we focus on the graph remaining after iterations of peeling, we can perform roughly exponentiation steps without violating the memory constraints. Hence, we can perform steps of the degree reduction process in roughly communication rounds.
In case the maximum degree is larger than the local memory , one needs to pay attention to how the graph is distributed. One explicit way is to split highdegree nodes into many copies and distribute the copies among many machines. For the communication between the copies, one can imagine a (virtual) balanced tree of depth rooted at one of the copies. Through this tree, the copies can exchange information in communication rounds. For the sake of simplicity, our writeup assumes that .
4.1 Partition Phase
We first prove some properties of the partition constructed by the greedy peeling algorithm that will be useful for an efficient implementation in the lowmemory model.
Lemma 4.1.
The partition with outdegree , constructed by the greedy peeling algorithm, satisfies the following properties.

[(i)]

For all , the number of nodes in layers with index is at most . In other words, if we remove all nodes in layer from the set of nodes in layers , then the number of nodes drops by a factor of , i.e., for all .

There are at most layers.
Proof.
We prove (i) by induction, thus assume that there are nodes in the graph induced by vertices in layers . Towards a contradiction, suppose that there are nodes in layers . By construction, all these nodes must have had degree larger than in , as otherwise they would have been added to layer . This results in an average degree of more than in , which contradicts the wellknown upper bound of on the average degree in a graph that has arboricty at most . Note that (ii) is a direct consequence of (i). ∎
In the following, we describe how to compute the partition with parameter in the lowmemory model^{6}^{6}6Note that it is easy to learn the maximum degree, and hence , in rounds of communications.. Throughout this section, we assume that , i.e., that . Observe that if , then the partition with parameter consists of layers, in which case the arguments in this section imply that going through the layers one by one will easily yield at least as good runtimes as for the more difficult case of . Hence, throughout this section, we assume that .
The goal of the algorithm for computing the partition is that each node (or, more formally, the machine storing the node) knows in which layer of the partition it is contained. The algorithm proceeds in iterations, where each iteration consists of two parts: first, the output, i.e., the layer index, is determined for a large fraction of the nodes, and second, these nodes are removed for the remainder of the computation. The latter ensures that the remaining small fraction of nodes can use essentially all of the total available memory in the next iteration, resulting in a larger memory budget per node. However, there is a caveat: When the memory budget per node exceeds , i.e., the memory capacity of a single machine, then it is not sufficient anymore to merely argue that the used memory of all nodes together does not exceed the total memory of all machines together^{7}^{7}7Furthermore, in the “shuffle” step of every MPC round [KSV10], we assume that the nodes are stored in the machines in a balanced way, i.e., as long as a single node fits onto a single machine and the total memory is not exceeded, the underlying system takes care of loadbalancing.. We circumvent this issue by starting the above process repeatedly from anew (in the remaining graph) each time the memory requirement per node reaches the memory capacity of a single machine. As we will see, the number of repetitions, called phases, is bounded by .
In the following, we examine the phases and iterations in more detail.
Algorithm Details
Let be the largest integer s.t. (which implies that ). The algorithm consists of phases and each phase consists of iterations. Next, we describe our implementation of the graph exponentiation in more detail.

In each iteration , we do the following.

Let be the graph at the beginning of iteration . Each node connects its current hop neighborhood to a clique by adding virtual edges to ; if , omit this step. Perform repetitions of the following process if , and repetitions if :

In repetition (resp. ), each node computes its layer index in the partition of (with parameter ) or determines that its layer index is strictly larger than , upon which all nodes in layer at most (and all its incident edges) are removed from the graph, resulting in a graph .
Set (resp. if ).

At the end of the phase remove all added edges.

The algorithm terminates when each node knows its layer.
Note that each time a node is removed from the graph, the whole layer that contains this node is removed, and each time such a layer is removed, all layers with smaller index are removed at the same time or before. By the definition of the partition, if we remove the layers with smallest index from a graph, then there is a 1to1 correspondence between the layers of the resulting graph and the layers with layer index at least of the original graph. More specifically, layer of the resulting graph contains exactly the same nodes as layer of the original graph. Hence, if a node knows its layer index in some , it can easily compute its layer index in our original input graph , by keeping track of the number of deleted layers, which is uniquely defined by , and the number of the phase. We implicitly assume that each node performs this computation upon determining its layer index in some and in the following only consider how to determine the layer index in the current graph.
Implementation in the Model
Let us take a look at one iteration.
Connecting the hop neighborhoods to cliques is done by adding the edges that are missing. Edges that are added by multiple nodes are only added once (since the edge in question is stored by the machines that contain an endpoint of the edge, this is straightforward to realize). Note that during a phase, the hop neighborhoods of the nodes grow in each iteration (if not too many closeby nodes are removed from the graph); more specifically, after iterations of connecting hop neighborhoods to cliques, the new hop neighborhood of a node contains exactly the nodes that were contained in its hop neighborhood at the beginning of the phase (and were not removed so far).
In iteration , the layer of a node is computed as follows: First each node locally gathers the topology of its hop neighborhood (without any added edges).^{8}^{8}8Note that it is easy to keep track of which edges are original and which are added, incurring only a small constant memory overhead; later we will also argue why storing the added edges does not violate our memory constraints. Since this step is performed after connecting the hop neighborhood of each node to a clique (by repeatedly connecting hop neighborhoods to cliques), i.e., after connecting each node to any other node in its hop neighborhood, only round of communication is required for gathering the topology. Moreover, since a node that knows the topology of its hop neighborhood can simulate any round distributed process locally, it follows from the definition of the partition, that knowledge of the topology of the hop neighborhood is sufficient for a node to determine whether its layer index is at most and, if this is the case, in exactly which layer it is contained. Thus, the only tasks remaining are to bound the runtime of our algorithm and to show that the memory restrictions of our model are not violated by the algorithm.
Runtime
It is easy to see that every iteration takes rounds. Thus, in order to bound the runtime of our algorithm, it is sufficient to bound the number of iterations by . By Lemma 4.1 (ii) the number of layers in the partition of our original input graph is , which is since . Consider an arbitrary phase. According to the algorithm description, in iteration , all nodes in the lowest layers are removed from the current graph. Hence, ignoring iteration , the number of removed layers doubles in each iteration, and we obtain that the number of layers removed in the iterations of our phase is . By the definition of , we have , which implies .
Combining this inequality with the observations about the total number of layers and the number of layers removed per phase, we see that the algorithm terminates after phases. Since there are iterations per phase, the bound on the number of iterations follows.
Memory Footprint
As during the course of the algorithm edges are added and nodes collect the topology of certain neighborhoods, we have to show that adding these edges and collecting these neighborhoods does not violate our memory constraints of per machine. As a first step towards this end, the following lemma bounds the number of nodes contained in graph .
Lemma 4.2.
Consider an arbitrary phase. Graph from that phase contains at most nodes, for all , where .
Proof.
By Lemma 4.1 (i), removing the nodes in the layer with smallest index from the current graph decreases the number of nodes by a factor of at least . We show the lemma statement by induction. Since in iteration the nodes in the layers with smallest index are removed, we know that contains at most nodes. Now assume that contains at most nodes, for an arbitrary . According to the design of our algorithm, is obtained from by removing the nodes in the layers with smallest index. Combining this fact with our observation about the decrease in the number of nodes per removed layer, we obtain that contains at most nodes. ∎
Using Lemma 4.2, we now show that the memory constraints of the lowmemory model are not violated by our algorithm. Consider an arbitrary phase and an arbitrary iteration during that phase. If , then no edges are added and each node already knows the topology of its hop neighborhood, so no additional memory is required. Hence, assume that .
Due to Lemma 4.2, the number of nodes considered in iteration is at most , where, again, . After the initial step of connecting hop neighborhoods to cliques in iteration , each remaining node is connected to all nodes that were contained in its hop neighborhood in the original graph (and were not removed so far). Hence, each remaining node is connected to at most other nodes, resulting in a memory requirement of per node, or in total. Similarly, when collecting the topology of its hop neighborhood, each node has to store edges, which requires at most memory, resulting in a total memory requirement of . Hence, the described algorithm does not exceed the total memory available in the lowmemory model. Moreover, due to the choice of , the memory requirement of each single node does not exceed the memory capacity of a single machine.
4.2 Simulation of the MarkandPropose and Selection Phase
For the simulation of the markandpropose and selection phase, we rely heavily on the approach of Section 4.1. Recall that nodes were removed in chunks consisting of several consecutive layers and that before a node was removed, was directly connected to all nodes contained in a large neighborhood around by adding the respective edges. For the simulation, we go through these chunks in the reverse order in which they were removed. Note that in which chunk a node is contained is uniquely determined by the layer index of the node. As each node computes its layer index during the construction of the partition, each node can easily determine in which part of the simulation it will actively participate.
However, there is a problem we need to address: For communication, we would like the edges that we added during the construction of the partition to be available also for the simulation. Unfortunately, during the course of the construction, we removed added edges again to free memory for the adding of other edges. Fortunately, there is a conceptually simple way to circumvent this problem: in the construction of the partition, add a preprocessing step in the beginning, in which we remove the lowest layers (where is a sufficiently large constant) one by one in rounds, which increases the available memory (compared to the number of (remaining) nodes) by a factor of , by Lemma 4.1. Since the algorithm for constructing the partition consist of iterations, this implies that we can store all edges that we add during the further course of the construction simultaneously without violating the memory restriction, by an argument similar to the analogous statement for the old construction of the partition. Similarly, also the number of added edges incident to one particular node does not exceed the memory capacity of a single machine. In the following, we assume that this preprocessing step took place and all edges added during the construction of the partition are also available for the simulation.
Matching Algorithm
As mentioned above, we process the chunks one by one, in decreasing order w.r.t. the indices of the contained layers. After processing a chunk, we want each node contained in the chunk to know the output of all incident edges according to the centralized matching algorithm. In the following, we describe how to process a chunk, after some preliminary “global" steps.
The markandpropose phase of the algorithm is straightworward to implement in the lowmemory model: each node (in each chunk at the same time) performs the marking of an outgoing edge as specified in the algorithm description.^{9}^{9}9Note that, formally, the algorithm for construction the partition only returns the layer index for each node; however, from this information each node can easily determine which edges are outgoing, unoriented, or incoming according to the partial orientation induced by the partition. The proposing is performed for all nodes before going through the chunks sequentially: each node proposes one of its marked incoming edges (it there is at least one) uniformly at random. Note that proposes an edge does not necessarily indicate that this edge will be added to the matching; more specifically, an edge proposed by some node will be added to the matching iff the edge that marked is not selected to be added to the matching.^{10}^{10}10In other words, only proposed edges can go into the matching and whether such an edge indeed goes into the matching can be determined by going through the layers in decreasing order and only adding a proposed edge if there is no conflict.
After this markandpropose phase, the processing of the chunks begins. Consider an arbitrary chunk. Let be the iteration (in some phase) in which this chunk was removed in the construction of the partition, i.e., the chunk consists of layers. Each node in the chunk collects the topology of its hop neighborhood in the chunk including the information contained therein about proposed edges. Due to the edges added during the construction of the partition, this can be achieved in a constant number of rounds, and by an analogous argument to the one at the end of Section 4.1, collecting the indicated information does not violate the memory restrictions of our model. Lemma 4.3 shows that the information contained in the hop neighborhood of a node is sufficient for the node to determine the output for each incident edge in the centralized matching algorithm.
Lemma 4.3.
The information about which edges are proposed in the hop neighborhood of a node uniquely determines the output of all edges incident to according to the centralized matching algorithm.
Proof.
From the design of the centralized matching algorithm, it follows that an edge is part of the matching iff 1) the edge is proposed and 2) either the higherlayer endpoint of the edge has no outgoing edges or the outgoing edge marked by the higherlayer endpoint is not part of the matching. Hence, in order to check whether an incident edge is in the matching, node only has to consider the unique directed chain of proposed edges (in the chunk) starting in . Clearly, the information which of the edges in this chain are proposed uniquely defines the output of the first edge in the chain, from which can infer the output of all other incident edges. Since the number of edges in the chain is bounded by as the chain is directed, the lemma statement follows. ∎
It thus follows from the bound on the number of iterations that the simulation of the selection phase for the matching algorithm can be performed in rounds of communication.
Independent Set Algorithm
The simulation of the independent set algorithm proceeds analogously to the case of the matching algorithm. First, each node performs the marking and proposing in a distributed fashion in a constant number of rounds. Then, the chunks are processed one by one, as above, where during the processing of a chunk removed in iteration , each node contained in the chunk collects its hop neighborhood, including the information about which nodes are proposed, and then computes its own output locally. By analogous arguments to the ones presented in the case of the matching algorithm, the algorithm adheres to the memory constraints of our model and the total number of communication rounds is . The only part of the argumentation where a bit of care is required is the analogue of Lemma 4.3: In the case of the independent set algorithm the output of a node may depend on each of its parents since each of those could be part of the independent set, which would prevent from joining the independent set. However, all nodes in the chunk that can be reached from via a directed chain of edges are contained in ’s hop neighborhood; therefore, collecting the own hop neighborhood is sufficient for determining one’s output. Note that at the end of processing a chunk, if we follow the above implementation, we have to spend an extra round for removing the neighbors of all selected independent set nodes since these may be contained in another chunk.
References
 [ABB17] Sepehr Assadi, MohammadHossein Bateni, Aaron Bernstein, Vahab Mirrokni, and Cliff Stein. Coresets Meet EDCS: Algorithms for Matching and Vertex Cover on Massive Graphs. arXiv preprint: 1711.03076, 2017.
 [ABI86] Noga Alon, László Babai, and Alon Itai. A Fast and Simple Randomized Parallel Algorithm for the Maximal Independent Set Problem. Journal of Algorithms, 7(4):567–583, 1986.
 [AK17] Sepehr Assadi and Sanjeev Khanna. Randomized Composable Coresets for Matching and Vertex Cover. In the Proceedings of the Symposium on Parallel Algorithms and Architectures (SPAA), pages 3–12, 2017.

[ANOY14]
Alexandr Andoni, Aleksandar Nikolov, Krzysztof Onak, and Grigory Yaroslavtsev.
Parallel algorithms for geometric graph problems.
In
Proceedings of the Symposium on Theory of Computing (STOC)
, pages 574–583, 2014.  [ASS18] Alexandr Andoni, Clifford Stein, Zhao Song, Zhengyu Wang, and Peilin Zhong. Parallel Graph Connectivity in Log Diameter Rounds. arXiv preprint arxiv:1805.03055, 2018.
 [BDHK18] Soheil Behnezhad, Mahsa Derakhshan, MohammadTaghi Hajiaghayi, and Richard M. Karp. Massively Parallel Symmetry Breaking on Sparse Graphs: MIS and Maximal Matching. arXiv preprint arXiv:1807.06701, 2018.
 [BE10] Leonid Barenboim and Michael Elkin. Sublogarithmic Distributed MIS Algorithm for Sparse Graphs Using NashWilliams Decomposition. Distributed Computing, 22(56):363–379, 2010.
 [BEPS12] Leonid Barenboim, Michael Elkin, Seth Pettie, and Johannes Schneider. The Locality of Distributed Symmetry Breaking. In the Proceedings of the Symposium on Foundations of Computer Science (FOCS), pages 321–330. IEEE, 2012.
 [BFU18] Sebastian Brandt, Manuela Fischer, and Jara Uitto. Breaking the LinearMemory Barrier in MPC: Fast MIS on Trees with Memory per Machine. arXiv preprint arXiv:1802.06748, 2018.
 [BKS14] Paul Beame, Paraschos Koutris, and Dan Suciu. Skew in Parallel Query Processing. pages 212–223, 2014.
 [BKS17] Paul Beame, Paraschos Koutris, and Dan Suciu. Communication Steps for Parallel Query Processing. Journal of the ACM (JACM), 64(6):40, 2017.
 [CŁM18] Artur Czumaj, Jakub Łącki, Aleksander Mądry, Slobodan Mitrović, Krzysztof Onak, and Piotr Sankowski. Round Compression for Parallel Matching Algorithms. In Proceedings of the Symposium on Theory of Computing (STOC), pages 471–484, 2018.
 [DG08] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107–113, 2008.
 [GG06] Gaurav Goel and Jens Gustedt. Bounded Arboricity to Determine the Local Structure of Sparse Graphs. In International Workshop on GraphTheoretic Concepts in Computer Science, pages 159–167. Springer, 2006.
 [GGK18] Mohsen Ghaffari, Themis Gouleakis, Christian Konrad, Slobodan Mitrović, and Ronitt Rubinfeld. Improved Massively Parallel Computation Algorithms for MIS, Matching, and Vertex Cover. In Proceedings of the International Symposium on Principles of Distributed Computing (PODC), page to appear, 2018.
 [Gha16] Mohsen Ghaffari. An Improved Distributed Algorithm for Maximal Independent Set. In the Proceedings of ACMSIAM Symposium on Discrete Algorithms (SODA), pages 270–277, 2016.
 [GSZ11] Michael T. Goodrich, Nodari Sitchinava, and Qin Zhang. Sorting, Searching, and Simulation in the MapReduce Framework. pages 374–383, 2011.
 [GU18] Mohsen Ghaffari and Jara Uitto. Sparsifying Distributed Algorithms with Ramifications in Massively Parallel Computation and Centralized Local Computation. Personal communication, 2018.
 [IBY07] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: Distributed DataParallel Programs from Sequential Building Blocks. volume 41, pages 59–72, 2007.
 [KSV10] Howard Karloff, Siddharth Suri, and Sergei Vassilvitskii. A Model of Computation for MapReduce. In the Proceedings of ACMSIAM Symposium on Discrete Algorithms (SODA), pages 938–948, 2010.
 [KSV13] Amos Korman, JeanSébastien Sereni, and Laurent Viennot. Toward More Localized Local Algorithms: Removing Assumptions Concerning Global Knowledge. Distributed Computing, 26(56):289–308, 2013.
 [Lin92] Nathan Linial. Locality in Distributed Graph Algorithms. SIAM Journal on Computing, 21(1):193–201, 1992.
 [LMSV11] Silvio Lattanzi, Benjamin Moseley, Siddharth Suri, and Sergei Vassilvitskii. Filtering: a Method for Solving Graph Problems in MapReduce. In the Proceedings of the Symposium on Parallel Algorithms and Architectures (SPAA), pages 85–94, 2011.
 [LPSR09] Zvi Lotker, Boaz PattShamir, and Adi Rosén. Distributed Approximate Matching. SIAM Journal on Computing, 39(2):445–460, 2009.
 [Lub85] Michael Luby. A Simple Parallel Algorithm for the Maximal Independent Set Problem. In Proceedings of the Symposium on Theory of Computing (STOC), pages 1–10, 1985.
 [LW10] Christoph Lenzen and Roger Wattenhofer. Brief Announcement: Exponential SpeedUp of Local Algorithms Using NonLocal Communication. In Proceedings of the International Symposium on Principles of Distributed Computing (PODC), pages 295–296, 2010.

[McG05]
Andrew McGregor.
Finding Graph Matchings in Data Streams.
In
Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques
, pages 170–181. Springer, 2005.  [NW61] Crispin St. J. NashWilliams. EdgeDisjoint Spanning Trees of Finite Graphs. Journal of the London Mathematical Society, 1(1):445–450, 1961.
 [NW64] Crispin St. J. NashWilliams. Decomposition of Finite Graphs into Forests. Journal of the London Mathematical Society, 1(1):12–12, 1964.
 [OSSW18] Krzysztof Onak, Baruch Schieber, Shay Solomon, and Nicole Wein. Fully Dynamic MIS in Uniformly Sparse Graphs. In the Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP), pages 92:1–92:14, 2018.
 [Pem01] Sriram V. Pemmaraju. Equitable Coloring Extends ChernoffHoeffding Bounds. In Approximation, Randomization, and Combinatorial Optimization: Algorithms and Techniques, pages 285–296. Springer, 2001.
 [Whi12] Tom White. Hadoop: The Definitive Guide. 2012.
 [ZCF10] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster Computing with Working Sets. HotCloud, 10:95, 2010.
Comments
There are no comments yet.