# Streaming Complexity of Spanning Tree Computation

The semi-streaming model is a variant of the streaming model frequently used for the computation of graph problems. It allows the edges of an n-node input graph to be read sequentially in p passes using Õ(n) space. In this model, some graph problems, such as spanning trees and k-connectivity, can be exactly solved in a single pass; while other graph problems, such as triangle detection and unweighted all-pairs shortest paths, are known to require Ω̃(n) passes to compute. For many fundamental graph problems, the tractability in these models is open. In this paper, we study the tractability of computing some standard spanning trees. Our results are: (1) Maximum-Leaf Spanning Trees. This problem is known to be APX-complete with inapproximability constant ρ∈[245/244,2). By constructing an ε-MLST sparsifier, we show that for every constant ε > 0, MLST can be approximated in a single pass to within a factor of 1+ε w.h.p. (albeit in super-polynomial time for ε<ρ-1 assuming PNP). (2) BFS Trees. It is known that BFS trees require ω(1) passes to compute, but the naïve approach needs O(n) passes. We devise a new randomized algorithm that reduces the pass complexity to O(√(n)), and it offers a smooth tradeoff between pass complexity and space usage. (3) DFS Trees. The current best algorithm by Khan and Mehta [STACS 2019] takes Õ(h) passes, where h is the height of computed DFS trees. Our contribution is twofold. First, we provide a simple alternative proof of this result, via a new connection to sparse certificates for k-node-connectivity. Second, we present a randomized algorithm that reduces the pass complexity to O(√(n)), and it also offers a smooth tradeoff between pass complexity and space usage.

## Authors

• 19 publications
• 17 publications
• 1 publication
• 4 publications
01/11/2019

### Depth First Search in the Semi-streaming Model

Depth first search (DFS) tree is a fundamental data structure for solvin...
07/28/2020

### Graph Spanners by Sketching in Dynamic Streams and the Simultaneous Communication Model

Graph sketching is a powerful technique introduced by the seminal work o...
09/25/2019

### Streaming PTAS for Binary ℓ_0-Low Rank Approximation

We give a 3-pass, polylog-space streaming PTAS for the constrained binar...
09/30/2018

### Streaming Algorithms for Planar Convex Hulls

Many classical algorithms are known for computing the convex hull of a s...
09/30/2021

### Deterministic Graph Coloring in the Streaming Model

Recent breakthroughs in graph streaming have led to the design of single...
07/06/2020

### Streaming Verification for Graph Problems: Optimal Tradeoffs and Nonlinear Sketches

We study graph computations in an enhanced data streaming setting, where...
04/10/2019

### Minimum Spanning Trees in Weakly Dynamic Graphs

In this paper, we study weakly dynamic undirected graphs, that can be us...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Spanning trees are critical components of graph algorithms, from depth-first search trees (DFS) for finding articulation points and bridges [45], computing -numbering [13], chain decomposition [42], and coloring signed graphs [18], to breadth-first search trees (BFS) for finding separators [34], computing sparse certificates of -node-connectivity [8, 12], approximating diameters [10, 41], and characterizing AT-free graphs [5], and to maximum-leaf spanning trees (MLST) for connected dominating sets [36, 43] and connected maximum cuts [26, 21].

In the semi-streaming model, the tractability of spanning tree computation, except arbitrary spanning trees [3, 44, 40], is less studied. The semi-streaming model [38, 3] is a variation of streaming model frequently used for the computation of graph problems. It allows the edges of an -node input graph to be read sequentially in passes using 111We write to denote or where is the number of nodes in the input graph. Similarly, denotes or . space. If the list of edges includes deletions, then the model is called the turnstile model; otherwise it is called the insertion-only model. In both models, some graph problems, such as spanning trees [3], -connectivity [25], densest subgraph [37], degeneracy [15], cut-sparsifier [30], and -coloring [4], can be exactly solved or -approximated in a single pass, while other graph problems, such as triangle detection and unweighted all-pairs shortest paths [7], are known to require passes to compute. For many fundamental graph problems, e.g., standard spanning trees, the tractability in these models is open. BFS computation is known to require passes [17], but only the naive -pass algorithm is known. It is unknown whether DFS computation requires more than one passes [14, 31], but the current best algorithm needs passes [31] where is the height of the computed DFS trees, so for dense graphs. The tractability of maximum-leaf spanning trees (MLST) is unknown even allowing space, since it is APX-complete [35, 20].

Due to the lack of efficient streaming algorithms for spanning tree computation, for some graph problems that are traditionally solved using spanning trees, such as finding articulation points and bridges, people had to look for alternative methods when designing streaming algorithms for these problems [16, 14]. The alternative methods, even if they are based on known results in graph theory, may still involve the design of new streaming algorithms. For the problems mentioned above, the alternative methods use newly-designed sparse connectivity certificates [12, 25] that are easily computable in the semi-streaming model, rather than the classical one due to Nagamochi and Ibaraki [39]. Hence establishing the hardness of spanning tree computation helps to explain the need of the alternative methods.

In this paper, we study the tractability of computing standard spanning trees for connected simple undirected graphs, including BFS trees, DFS trees, and MLST. Unless otherwise stated, our upper bounds work in the turnstile model (and hence also in the insertion-only model), and our lower bounds hold for the insertion-only model (and hence also in the turnstile model). The space upper and lower bounds are in bits. Our results are as follows.

#### Maximum-Leaf Spanning Trees:

We show, by constructing an -MLST sparsifier (Section 2), that for every constant , MLST can be approximated in a single pass to within a factor of w.h.p.222W.h.p. means with probability . (albeit in super-polynomial time for since it is APX-complete [35, 20] with inapproximability constant  [9]) and can be approximated in polynomial time in a single pass to within a factor of w.h.p., where is the supremum constant that MLST cannot be approximated to within using polynomial time and space. In the insertion-only model, these algorithms are deterministic. We also show a complementary hardness result (Section 5) that for every , to approximate MLST to within an additive error

, any single-pass randomized streaming algorithm that succeeds with probability at least

requires bits. This hardness result excludes the possibility to have a single-pass semi-streaming algorithm to approximate MLST to within an additive error

. Our results for MLST shows that intractability in the sequential computation model (i.e., Turing machine) does not imply intractability in the semi-streaming model.

Our algorithms rely on a new sparse certificate, the -MLST sparsifier, defined as follows. Let be an -node -edge connected simple undirected graph. Then for any given constant , is an -MLST sparsifier if it is a connected spanning subgraph of with and , where denotes the maximum number of leaves (i.e. nodes of degree one) that any spanning tree of can have and is some function independent of . We show that an -MLST sparsifier can be constructed efficiently in the semi-streaming model.

In the turnstile model, for every constant , there exists a randomized algorithm that can find an -MLST sparsifier with probability using a single pass, space, and time, and in the insertion-only model a deterministic algorithm that uses a single pass, space, and time.

Combining Section 1 with any polynomial-time RAM algorithms for MLST that uses space, e.g, [35, 36, 43], we obtain the following result.

In the turnstile model, for every constant , there exists a randomized algorithm that can approximate for any -node connected simple undirected graph with probability to within a factor of using a single pass, space, and polynomial time, where is the supremum constant that MLST cannot be approximated to within using polynomial time and space, and in the insertion-only model a deterministic algorithm that uses a single pass, space, and polynomial time.

Using Section 1, we show that approximate connected maximum cut can be computed in a single pass using space for unweighted regular graphs (Section 2).

#### BFS Trees:

It is known that BFS trees require passes to compute [17], but the naive approach needs passes. We devise a randomized algorithm that reduces the pass complexity to w.h.p., and give a smooth tradeoff between pass complexity and space usage.

In the turnstile model, for each , there exists a randomized algorithm that can compute a BFS tree for any -node connected simple undirected graph with probability in passes using space, and in the insertion-only model a deterministic algorithm that uses space.

This gives a polynomial separation between single-source and all-pairs shortest paths for unweighted graphs because any randomized semi-streaming algorithm that computes unweighted all-pairs shortest paths with probability at least requires passes.

We extend Section 1 and obtain that multiple BFS trees, each starting from a unique source node, can be computed more efficiently in pass complexity in a batch than individually (see Section 3.3). We show that this batched BFS has applications to computing a -approximation of diameters for unweighted graphs (Section 3.4) and a -approximation of Steiner trees for unweighted graphs (Section 3.3).

#### DFS Trees:

It is unknown whether DFS trees require more than one passes [14, 31], but the current best algorithm needs passes due to Khan and Mehta [31], where is the height of computed DFS trees. We devise a randomized algorithm that has pass complexity w.h.p., and give a smooth tradeoff between pass complexity and space usage.

In the turnstile model, for each , there exists a randomized algorithm that can compute a DFS tree for any -node connected simple undirected graph with probability in passes that uses space, and in the insertion-only model a deterministic algorithm that uses space.

For dense graphs, our algorithms improves upon the current best algorithms for DFS due to Khan and Mehta [31] which needs passes for -node -edge graphs in the worst case because of the existence of -cores, where a -core is a maximal connected subgraph in which every node has at least neighboring nodes in the subgraph.

### 1.1 Technical Overview

#### Maximum-Leaf Spanning Trees:

We construct an -MLST sparsifier by a new result that complements Kleitman and West’s lower bounds on the maximum number of leaves for graphs with minimum degree  [32]. The lower bounds are: if a connected simple undirected graph has minimum degree for some sufficiently large , then and the leading constant can be larger for . Our complementary result (Section 2), without the restriction on the minimum degree, is: any connected simple undirected graph , except the singleton graph, has

 leaf(G)≥110(|V(G)|−inode(G)), (1)

where denotes the number of nodes whose degree is two and whose neighbors both have degree two. Equation 1 implies that, if one can find a connected spanning subgraph of so that , then one gets an -MLST sparsifier.

Our sparsification technique is general enough to obtain a -approximation for MLST in a single pass using space by combining any -approximation -space RAM algorithm for MLST with our -MLST sparsifier. On the other hand, since in linear time one can find an -MLST sparsifier of edges, any -approximation RAM algorithm for MLST with time complexity can be reduced to if a small sacrifice on approximation ratio is allowed. This reduces the time complexity of RAM algorithms for MLST that need superlinear time on the number of edges, such as the local search approach from for to and the leafy forest approach from to , both due to Lu and Ravi [35, 36].

#### BFS Trees:

We present a simple deterministic algorithm attaining a smooth tradeoff between pass complexity and space usage. In particular, in the insertion-only semi-streaming model, the algorithm finishes in passes. The algorithm is based on an observation that the sum of degrees of nodes in any root-to-leaf path of a BFS tree is bounded by (Section 3.1).

Our more efficient randomized algorithm (Section 1) constructs a BFS tree by combining the results of multiple instances of bounded-radius BFS. To reduce the space usage, the simulation of these bounded-radius BFS are assigned random starting times, and the algorithm only maintains the last three layers of each BFS tree. These ideas are borrowed from results on shortest paths computation in the parallel and the distributed settings [11, 22, 27, 46].

#### DFS Trees:

We present a simple alternative proof of the result of Khan and Mehta [31] that a DFS tree can be constructed in passes using space, for any given parameter , where is the height of the computed DFS tree. The new proof is based on the following connection between the DFS computation and the sparse certificates for -node-connectivity. We show in Lemma 4.1 that the first layers of any DFS tree of a such a certificate can be extended to a DFS tree of the original graph .

The proof of Theorem 1 is based on the parallel DFS algorithm of Aggarwal and Anderson [2]. In this paper we provide an efficient implementation of their algorithm in the streaming model, also via the sparse certificates for -node-connectivity, which allows us to reduce the number of passes by batch processing.

We note that in a related work, Ghaffari and Parter [23] showed that the parallel DFS algorithm of Aggarwal and Anderson can be adapted to distributed setting. Specifically, they showed that DFS can be computed in the CONGEST model in rounds, where is the diameter of the graph.

### 1.2 Paper Organization

In Section 2, we present how to construct an -MLST sparsifier and apply it to devise single-pass semi-streaming algorithms to approximate MLST to within a factor of for every constant . Then, in Section 3, we show how to compute a BFS tree rooted at a given node by an -pass -space algorithm w.h.p. and its applications to computing approximate diameters and approximate Steiner trees. In Section 4, we have a similar result for computing DFS trees; that is, -pass -space algorithm that succeeds w.h.p. Lastly, we prove the claimed single-pass lower bound in Section 5.

## 2 Maximum-Leaf Spanning Trees

In this section, we will show how to construct an -MLST sparsifier in the semi-streaming model; that is, proving Section 1. We recall the notions defined in Section 1 before proceeding to the results. By ignorable node, we denote a node whose degree is two and whose neighbors and have degree two as well. Note that for simple graphs. Let be the maximum number of leaves (i.e. nodes of degree one) that a spanning tree of can have. Let denote the number of ignorable nodes in . Let denote the degree of node in graph . Let denote any subgraph of so that contains all nodes in and every node in has degree . Let be any spanning tree of a connected graph .

We begin with a result that complements Kleitman and West’s lower bounds on the number of leaves for graphs with minimum degree for any . Our lower bound does not rely on the degree constraint. The constant in Lemma 2 may be improved, but the subsequent lemmata and theorems only require it to be .

Every connected simple undirected graph , except the singleton graph, has

 leaf(G)≥110(|V(G)|−inode(G)).
###### Proof.

Our proof is a generalization of the dead leaf argument due to Kleitman and West [32]. Let be a tree rooted at with as leaves for some arbitrary node initially, where denotes the neighbors of , and then grow iteratively by a node expansion order, defined below. By expanding at node , we mean to select a leaf node of and add all of ’s neighbors in , say , and their connecting edges, , to . In this way, every node outside cannot be a neighbor of any non-leaf node in . We say a leaf node in is dead if it has no neighbor in . Let denote the number of non-ignorable nodes in that joins while the -th operation is applied. Let denote the change of the number of leaf nodes in while the -th operation is applied. Let denote the change of the number of dead leaf nodes in while the -th operation is applied. The subscript may be removed when the context is clear. We need to secure that holds for each of the following operations and the initial operation.

Operation 1:

If has a leaf node that has neighbors outside , then expand at . In this case, , , and .

Operation 2:

If every leaf node in has at most one neighbor outside and some node has neighbors in , then expand at one of ’s neighbors in . In this case, , , and .

Operation 3:

This operation is used only when the previous two operations do not apply. Let be some leaf in that has exactly one neighbor not in . For each , if is defined and all neighbors of other than are outside and has degree two in , then define to be the neighbor of other than . Suppose that for are defined and is not defined, then we expand at for each in order. Though can be arbitrarily large, . If is not defined and has neighbors other than in (thus in this case otherwise Operation 2 applies), then we discuss in subcases:

Subcase 1 ():

It is impossible to have for this case.

Subcase 2 ():

Then and .

Subcase 3 ():

Then and .

If is not defined and has 0 neighbor other than in , then is either 1 or . For , and . For , and .

It is clear that one can expand to get a spanning tree of by a sequence of the above operations. Because all leaves are eventually dead, . Consequently, , as desired. ∎

Given Section 2, our goal is, for every constant , find a sparse subgraph of the input graph so that:

1. The nodes incident to the edges in can be dominated by a small set of at most nodes, i.e. either in or has at least one neighbor node in using the edges in , where is any optimal MLST of .

2. is connected.

Because of the existence of the small dominating set , one can obtain a forest from by adding some edges in so that the number of leaves in is no less than that in by and the number of connected components in is no more than that in by . Since is connected, one can further obtain a spanning tree from by adding at most edges in , so the number of leaves in is no less than that in by . Pick an associated with a sufficiently small , by Equation 1 is an -MLST sparsifier. A formal proof is given below.

For every integer , every connected simple undirected graph has

 leaf(Sk(G)∪T(G))≥(1−30(1+ln(k+1)k+1))leaf(G).
###### Proof.

Let be a spanning tree of that has leaves. Let be some fixed integer at least and let . Let . Note that every node has , so and all neighbors of are not ignorable nodes in .

First, we show that can be dominated by a small set of size at most using some edges in . We obtain from two parts, and . is a random node subset sampled from the non-ignorable nodes in , in which each node is included in with probability independently, for some to be determined later. Thus, . Since every node is adjacent only to the non-ignorable nodes in , the probability that is not dominated by any node in is

 Pr[x is not dominated]=(1−p)1+degH(x)≤(1−p)k+1.

Let be the set of nodes in that are not dominated by any node in using the edges in . Thus,

 E[|S|]=E[|S1|+|S2|]≤(p+(1−p)k+1)(|V(G)|−inode(G)).

Then, we obtain a forest from by adding some edges in as follows. Initially, .

Operation 1:

For each , if is an isolated node in and , then add an edge from to some node in to . Such an edge must exist because dominates .

Operation 2:

For each , if is not an isolated node in and the connected component that contains has an empty intersection with , then add an edge from to some node in to . Again, such an edge must exist because dominates .

For each leaf , if , then is a leaf in (also in unless ); otherwise , if is not a leaf in , then must be an isolated node in , and by Operation 1 is connected to some node in unless . Hence, except those in , every is a leaf node in , so the number of leaves in is no less than that in by . By Operation 2, the number of connected component is at most .

Lastly, since is connected, one can obtain a spanning tree from by connecting the components in by some edges in . Thus, the number of leaves in is no less than that in by . To obtain an -MLST sparsifier, by Section 2 we need:

 3|S|110(|V(G)|−inode(G))≤30(p+(1−p)k+1)≤30(p+e−p(k+1))≤ε

Setting gives the desired bound, and the leading constant is positive for . ∎

To find such a subgraph , fetching a spanning tree of the input graph and grabbing edges for each node in suffices. Thus, we get a single-pass -space algorithm for the insertion-only model. As for the turnstile model, we use -samplers [29] for each node in to fetch at least neighbors of w.h.p., and fetch a spanning tree by appealing to the single-pass -space algorithm for spanning trees in dynamic streams [3]. This gives a proof of Section 1.

#### Applications.

In [21], Gandhi et al. show a connection between the maximum-leaf spanning trees and connected maximum cut. Their results imply that, for any unweighted regular graph , the connected maximum cut can be found by the following two steps:

Step 1:

Find a spanning tree whose for some constant .

Step 2:

Randomly partition the leaves in into two parts and so that each leaf is included in with probability independently.

Then, outputting and yields an -approximation for connected maximum cut. Step 1 is the bottleneck and can be implemented by combining our -MLST sparsifier (Section 1) with the 2-approximation algorithm for MLST due to Solis-Oba, Bonsma, and Lowski [43]. This gives Section 2.

In the turnstile model, for every constant , there exists a randomized algorithm that can approximate the connected maximum cut for -node unweighted regular graphs to within a factor of with probability in a single pass using space.

A BFS tree of an -node connected simple undirected graph can be constructed in passes using space by simulating the standard BFS algorithm layer by layer. By storing the entire graph, a BFS tree can be computed in a single pass using space. In Section 3.1, we show that it is possible to have a smooth tradeoff between pass complexity and space usage. In Section 3.2, we prove Section 1, which shows that the above tradeoff can be improved when randomness is allowed, even in the turnstile model. Then, in Section 3.3, we show that multiple BFS trees, each starting from a distinct source node, can be computed more efficiently in a batch than individually. Lastly, we demonstrate an application to diameter approximation in Section 3.4.

In the BFS problem, we are given an -node connected simple undirected graph and a distinguished node , and it suffices to compute the distance for each node . To infer a BFS tree from the distance information , it suffices to assign a parent to each node the smallest-identifier node from the set where is the set of ’s neighbors. This can be done with one additional pass using space in the insertion-only model. In the turnstile model, for -pass streaming algorithms with , this can be done with additional passes w.h.p. using -samplers [29] for each node , and this costs space. For , the space bound is and one can use -samplers for each node, so this step can be done in one additional pass. Hence in the subsequent discussion we focus on computing the distance from to each node .

### 3.1 A Simple Deterministic Algorithm

We present a simple deterministic -pass -space algorithm in the insertion-only model by an observation that every root-to-leaf path in a BFS tree cannot visit too many high-degree nodes (Section 3.1). Then, one can simulate the standard BFS algorithm efficiently layer-by-layer over high-degree nodes (Section 3.1).

Let be a root-to-leaf path in some BFS tree of an -node connected simple undirected graph . Then

 ∑x∈PdegG(x)≤3n=O(n)

where denotes the degree of in .

###### Proof.

Suppose comprises nodes. Observe that if and have , then and cannot share any neighbor node; otherwise can be shorten, a contradiction. Thus, for each the total contribution of all ’s whose to is . Summing over all possible gives the bound. ∎

We note that Lemma 3.1 is near-optimal. To see why, let where is the union of disjoint sets and . By setting for some parameter , , for every , and , any BFS tree rooted at the node in has a root-to-leaf path of length , and each node in has degree . Pick any such that and . We have .

Given an -node connected simple undirected graph with a distinguished node , a BFS tree rooted at can be found deterministically in passes using space for every in the insertion-only model.

###### Proof.

Given a parameter , our algorithm goes as follows. In the first pass, keep arbitrary neighbors for each node in memory and then use the in-memory edges to update the distance for each by any single-source shortest path algorithm. The set of the in-memory edges is an invariant after the first pass. Hence, the memory usage is . Then, in each of the subsequent passes, processing the edges in the stream one by one, without keeping them in memory after the processing, if (resp. if ), then update (resp. ). After the edges in the stream are all processed, use the in-memory edges to update the distance for each again by any single-source shortest path algorithm but with initial distances. Our algorithm repeats until no distance has been updated in a single pass.

Observe a root-to-leaf path in some BFS tree rooted at . Suppose contains exactly edges that appears only on tape, let them be where for every . Let be the predecessor of on that is closest to among nodes in . By the definition of the above construction, it is assured that for each . Thus by Section 3.1, . Then we appeal to the argument used for the analysis of Bellman-Ford algorithm [19, 6]. For every , if , attains the minimum possible value at the same pass when attains; otherwise for some , attains the minimum possible value at most one pass after attains. Hence, passes suffices to compute for all and this argument applies to all root-to-leaf paths. Setting yields the desired bound. ∎

### 3.2 A More Efficient Randomized Algorithm

In this section, we prove Theorem 1. Our BFS algorithm is based on the following generic framework, which has been applied to finding shortest paths in the parallel and the distributed settings [11, 22, 27, 46]. Sample a set of approximately distinguished nodes such that each node joins independently with probability , and with probability 1. By a Chernoff bound, with high probability. We will grow a local BFS tree of radius from each node in , and then we will construct the final BFS tree by combining them. We will rely on the following lemma, which first appeared in [46].

[[46]] Let be a specified source node. Let be a subset of nodes such that each node joins with probability , and joins with probability 1. For any given parameter , the following holds with probability . For each node , there is an - shortest path such that each of its -node subpath satisfies .

For notational simplicity, in subsequent discussion we write . Lemma 3.2 shows that for each node ,

 dist(s,t)=minu∈U∩Nh(t)dist(s,u)+dist(u,t) (2)

with probability where .

To see this, consider the - shortest path specified in Lemma 3.2. If the number of nodes in is less than , then the above claim holds because . Otherwise, Lemma 3.2 guarantees that there is a node with probability . Using Equation 2, a BFS tree can be computed using the following steps.

1. Compute for each and . Using this information, we can infer for each .

2. Compute for each by the formula .

In what follows, we show how to implement the above two steps in the streaming model, using space and passes. By a change of parameter , we obtain Theorem 1.

#### Step 1.

To compute for each and , we let each initiate a radius- local BFS rooted at . A straightforward implementation of this approach in the streaming model costs passes and space, since we need to maintain search trees simultaneously.

We show that the space requirement can be improved to . Since we only need to learn the distances between nodes in , we are allowed to forget distance information associated with nodes when it is no longer needed. Specifically, suppose we start the BFS computation rooted at at the th pass, where is some number to be determined. For each , the induction hypothesis specifies that at the beginning of the th pass, all nodes in have learned that . During the th pass, for each node with , we check if has a neighbor in . If so, then we learn that .

In the above BFS algorithm, if for some , then we learn the fact that during the th pass. Observe that such information is only needed during the next two passes. After the end of the th pass, for each with , we are allowed to forget that . That is, only needs to participate in the BFS computation rooted at during these three passes .

For each , we assign the starting time independently and uniformly at random from . Lemma 3.2 shows that for each node and for each pass , the number of BFS computations that involve is . The idea of using random starting time to schedule multiple algorithms to minimize congestion can be traced back from [33]. Note that is the criterion for to participate in the BFS rooted at during the th pass.

For each node , and for each integer , with high probability, the number of nodes such that is at most .

###### Proof.

Given two nodes and , and a fixed number , the probability that is at most . Let be the total number of such that . The expected value of can be upper bounded by . By a Chernoff bound, with high probability, . ∎

Recall that with high probability, and . By Lemma 3.2, we only need space per each to do the radius- BFS computation from all nodes . That is, the space complexity is . To store the distance information for each and , we need space. Thus, the algorithm for Step 1 costs space. The number of passes is .

In the insertion-only model, the implementation is straightforward. In the turnstile model, care has to be taken when implementing the above algorithm. We write to be the high probability upper bound on the number of BFS computation that a node participates in a single pass. We write . Let be random subsets of such that each joins each with probability , independently. Consider a node and consider the th pass. Let be the subset of such that if , i.e., the BFS computation rooted at hits during the th pass. We know that with high probability . By our choice of , we can infer that with high probability for each there is at least one index such that .

To implement the th pass in the turnstile model, each node virtually maintains edge set . For each insertion (resp., deletion) of an edge satisfying for some , we add (resp., remove) the edge from the set . After processing the entire data stream, we take one edge out of each edge set . In view of the above discussion, it suffices to only consider these edges when we grow the BFS trees. This can be implemented using -samplers per each node , and the space complexity is still .

#### Step 2.

At this moment we have computed

for each . Now we need to compute for each by the formula .

In the insertion-only model, this task can be solved using iterations of Bellman-Ford steps. Initially, for each , and for each . During the th pass, we do the update . By Equation 2, we can infer that for each . A straightforward implementation of this procedure costs space and passes.

In the turnstile model, we can solve this task by growing a radius- BFS tree rooted at , for each , as in Step 1. During the process, each node maintains a variable

which serves as the estimate of

. Initially, . When the partial BFS tree rooted at hits , we update to be the minimum of the current value of and . At the end of the process, we have for each . This costs space and passes in view of the analysis of Step 1.

### 3.3 Extensions

In this section we consider the problem of solving instances of BFS simultaneously for some and a simpler problem of computing the pairwise distance between the given nodes. Both of these problems can be solved via a black box application of Section 1. In this section we show that it is possible to obtain better upper bounds.

Given an -node undirected graph , for any given parameters , the pairwise distances between all pairs of nodes in a given set of nodes in can be computed with probability using passes and space in the turnstile model.

###### Proof.

Let be the input node set of size . Consider the modified Step 1 of our algorithm where each is included in with probability 1. Since , we still have with high probability. Recall that Step 1 of our algorithm calculates for each and in space and passes. Applying Equation 2 for each , we obtain the pairwise distances between all pairs of nodes in , which includes as a subset. There is no need to do Step 2. ∎

For example, if , then Theorem 3.3 implies that we can compute the pairwise distances between all pairs of nodes in a given set of nodes in space and passes.

Given an -node undirected graph , for any given parameters , one can solve instances of BFS with probability using passes and space in the turnstile model.

###### Proof.

Let be the node set of size corresponding to the roots of the BFS instances. Consider the following modifications to our BFS algorithm.

Same as the proof of Theorem 3.3, in Step 1, include each in with probability 1. The modified Step 1 still takes space and passes, and it outputs the pairwise distances between all pairs of nodes in .

Now consider Step 2. In the insertion-only model, remember that a BFS tree rooted at a node can be constructed in space and passes using iterations of Bellman-Ford steps. The cost of constructing all BFS trees is then space and passes.

In the turnstile model, we can also use the strategy of growing a radius- BFS tree rooted at , for each . During the process, each node maintains variables serving as the estimates of , for all . The complexity of growing radius- BFS trees is still space and passes. The extra space cost for maintaining these variables is . ∎

For example, if , then Theorem 3.3 implies that we can solve instances of BFS in space and passes. Note that the space complexity of is necessary to output BFS trees.

Section 3.3 immediately gives the following corollary.

Given an -node connected undirected graph with unweighted edges and a -node subset of , for any given parameters , finding a Steiner tree in that spans can be approximated to within a factor of with probability using passes and space in the turnstile model.

Note that if we do not need to construct a Steiner tree, and only need to approximate the size of an optimal Steiner tree, then Section 3.3 can be used in place of Section 3.3.

### 3.4 Diameter Approximation

It is well-known that the maximum distance label in a BFS tree gives a -approximation of diameter. We show that it is possible to improve the approximation ratio to nearly without sacrificing the space and pass complexities.

Roditty and Williams [41] showed that a nearly -approximation of diameter can be computed with high probability as follows.

1. Let be a node set chosen by including each node to with probability independently. Perform a BFS from each node .

2. Let be a node chosen to maximize . Break the tie arbitrarily. Perform a BFS from .

3. Let be the node set consisting of the nodes closest to , where ties are broken arbitrarily. Perform a BFS from each node .

Let be the maximum distance label ever computed during the BFS computations in the above procedure. Roditty and Williams [41] proved that satisfies that , where is the diameter of .

The algorithm of Roditty and Williams [41] can be implemented in the streaming model by applying Theorem 3.3 with , but we can do better. Note that when we perform BFS from the nodes in and , it is not necessary to store the entire BFS trees. For example, in order to select , we only need to let each node know , which is the maximum distance label of in all BFS trees computed in Step 1. Therefore, the term in the space complexity of Theorem 3.3 can be improved to . That is, the space and pass complexities are the same as the cost for computing a single BFS tree using Section 1. We conclude the following theorem.

Given an -node connected undirected graph , a diameter approximation satisfying , where is the diameter of , can be computed with probability in passes using space, for each in the turnstile model.

## 4 Depth-First Search

A straightforward implementation of the naive DFS algorithm in the streaming model costs either passes with space or a single pass with space. Khan and Mehta [31] recently showed that it is possible to obtain a smooth tradeoff between the two extremes. Specifically, they designed an algorithm that requires at most passes using space, where is any positive integer. Furthermore, for the case the height of the computed DFS tree is small, they further decrease the number of passes to . In Section 4.1, we will provide a very simple alternative proof of this result, via sparse certificates for -node-connectivity.

In the worst case, the “space number of passes” of the algorithms of Khan and Mehta [31] is still . In Sections 4.2 and 4.3, we will show that it is possible to improve this upper bound asymptotically when the number of passes is super-constant. Specifically, for any parameters , we obtain the following DFS algorithms.

• A deterministic algorithm using passes and space in the insertion-only model. After balancing the parameters, the space complexity is