Depth First Search in the Semi-streaming Model

Depth first search (DFS) tree is a fundamental data structure for solving various graph problems. The classical DFS algorithm requires O(m+n) time for a graph having n vertices and m edges. In the streaming model, an algorithm is allowed several passes (preferably single) over the input graph having a restriction on the size of local space used. Trivially, a DFS tree can be computed using a single pass using O(m) space. In the semi-streaming model allowing O(n) space, it can be computed in O(n) passes, where each pass adds one vertex to the DFS tree. However, it remains an open problem to compute a DFS tree using o(n) passes using o(m) space even in any relaxed streaming environment. We present the first semi-streaming algorithms that compute a DFS tree of an undirected graph in o(n) passes using o(m) space. We first describe an extremely simple algorithm that requires at most n/k passes using O(nk) space, where k is any positive integer. We then improve this algorithm by using more involved techniques to reduce the number of passes to h/k under similar space constraints, where h is the height of the computed DFS tree. In particular, this algorithm improves the bounds for the case where the computed DFS tree is shallow (having o(n) height). Moreover, this algorithm is presented as a framework that allows the flexibility of using any algorithm to maintain a DFS tree of a stored sparser subgraph as a black box, which may be of independent interest. Both these algorithms essentially demonstrate the existence of a trade-off between the space and number of passes required for computing a DFS tree. Furthermore, we evaluate these algorithms experimentally which reveals their exceptional performance in practice. For both random and real graphs, they require merely a few passes even when allowed just O(n) space.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/19/2018

Coloring in Graph Streams

In this paper, we initiate the study of the vertex coloring problem of a...
01/21/2020

Streaming Complexity of Spanning Tree Computation

The semi-streaming model is a variant of the streaming model frequently ...
09/30/2021

Deterministic Graph Coloring in the Streaming Model

Recent breakthroughs in graph streaming have led to the design of single...
12/12/2017

Approximate Convex Hull of Data Streams

Given a finite set of points P ⊆R^d, we would like to find a small subse...
11/05/2019

Weighted Min-Cut: Sequential, Cut-Query and Streaming Algorithms

Consider the following 2-respecting min-cut problem. Given a weighted g...
06/05/2020

Efficient Semi-External Depth-First Search

Computing Depth-First Search (DFS) results, i.e. depth-first order or DF...
03/05/2018

Controlled quantum search on structured databases

We present quantum algorithms to search for marked vertices in structure...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Depth first search (DFS) is a well known graph traversal technique. Right from the seminal work of Tarjan [51], DFS traversal has played an important role in the design of efficient algorithms for many fundamental graph problems, namely, bi-connected components, strongly connected components, topological sorting [54], dominators in directed graph [52], etc. Even in undirected graphs, DFS traversal have various applications including computing connected components, cycle detection, edge and vertex connectivity [22] (via articulation points and bridges), bipartite matching [35], planarity testing [36] etc. In this paper, we address the problem of computing a DFS tree in the semi-streaming environment.

The streaming model [3, 26, 31] is a popular model for computation on large data sets wherein a lot of algorithms have been developed [28, 34, 31, 37] to address significant problems in this model. The model requires the entire input data to be accessed as a stream, typically in a single pass over the input, allowing very small amount of storage ( in input size). A streaming algorithm must judiciously choose the data to be saved in the small space, so that the computation can be completed successfully. In the context of graph problems, this model is adopted in the following fashion. For a given graph having vertices, an input stream sends the graph edges in using an arbitrary order only once, and the allowed size of local storage is . The algorithm iteratively asks for the next edge and performs some computation. After the stream is over, the final computation is performed and the result is reported. At no time during the entire process should the total size of stored data exceed .

In general only statistical properties of the graph are computable under this model, making it impractical for use in more complicated graph problems [24, 32]. A prominent exception for the above claim is the problem of counting triangles (-cycles) in a graph [6]. Consequently, several relaxed models have been proposed with a goal to solve more complex graph problems. One such model is called semi-streaming model [48, 25] which relaxes the storage size to . Several significant problems have been studied under this model (surveys in [49, 59, 47]). Moreover, even though it is preferred to allow only a single pass over the input stream, several hardness results [34, 16, 25, 14, 33] have reported the limitations of using a single pass (or even passes). This has led to the development of various multi-pass algorithms [25, 24, 46, 2, 39, 38] in this model. Further, several streaming algorithms maintaining approximate distances [24, 7, 19] are also known to require space (for some constant ) relaxing the requirement of space.

Now, a DFS tree of a graph can be computed in a single pass if space is allowed. If the space is restricted to , it can be trivially computed using passes over the input stream, where each pass adds one vertex to the tree. This can also be easily improved to passes, where is the height of the computed DFS tree. Despite most applications of DFS trees in undirected graphs being efficiently solved in the semi-streaming environment [56, 25, 24, 4, 5, 23, 41], due to its fundamental nature DFS is considered a long standing open problem [23, 49, 50] even for undirected graphs. Moreover, computing a DFS tree in passes is considered hard [23]. To the best of our knowledge, it remains an open problem to compute a DFS tree using passes even in any relaxed streaming environment.

In our results, we borrow some key ideas from recent sequential algorithms [10, 8] for maintaining dynamic DFS of undirected graphs. Recently, similar ideas were also used by Khan [40] who presented a semi-streaming algorithm that uses using space for maintaining dynamic DFS of an undirected graph, requiring passes per update.

1.1 Our Results

We present the first semi-streaming algorithms to compute a DFS tree on an undirected graph in passes. Our first result can be described using the following theorem.

Theorem 1.1.

Given an undirected graph , the DFS tree of the graph can be computed by a semi-streaming algorithm in at most passes using space, requiring time per pass.

As described earlier, a simple algorithm can compute the DFS tree in passes, where is the height of the DFS tree. Thus, for the graphs having a DFS tree with height (see Appendix B for details), we improve our result for such graphs in the following theorem.

Theorem 1.2.

Given an undirected graph , a DFS tree of can be computed by a semi-streaming algorithm using passes using space requiring amortized time per pass for any integer , where is the height of the computed DFS tree.111 Note that there can be many DFS trees of a graph having varying heights, say to . Our algorithm does not guarantee the computation of DFS tree having minimum height , rather it simply computes a valid DFS tree with height , where .

Since typically the space allowed in the semi-streaming model is , the improvement in upper bounds of the problem by our results is considerably small (upto factors). Recently, Elkin [20] presented the first pass algorithm for computing Shortest Paths Trees. Using local space, it computes the shortest path tree from a given source in passes for unweighted graphs, and in passes for weighted graphs. The significance of such results, despite improving the upper bounds by only small factors, is substantial because they address fundamental problems. The lack of any progress for such fundamental problems despite several decades of research on streaming algorithms further highlights the significance of such results. Moreover, allowing space (as in [24, 7, 19]) such results improves the upper bound significantly by factors. Furthermore, they demonstrate the existence of a trade-off between the space and number of passes required for computing such fundamental structures.

Our final algorithm is presented in form of a framework, which can use any algorithm for maintaining a DFS tree of a stored sparser subgraph, provided that it satisfies the property of monotonic fall. Such a framework allows more flexibility and is hopefully much easier to extend to better algorithms for computing a DFS tree or other problems requiring a computation of DFS tree. Hence we believe our framework would be of independent interest.

We also augment our theoretical analysis with the experimental evaluation of our proposed algorithms. For both random and real graphs, the algorithms require merely a few passes even when the allowed space is just . The exceptional performance and surprising observations of our experiments on random graphs might also be of independent interest.

1.2 Overview

We now briefly describe the outline of our paper. In Section 2 we establish the terminology and notations used in the remainder of the paper. In order to present the main ideas behind our approach in a simple and comprehensible manner, we present the algorithm in four stages. Firstly in Section 3, we describe the basic algorithm to build a DFS tree in passes, which adds a new vertex to the DFS tree in every pass over the input stream. Secondly in Section 3.1, we improve this algorithm to compute a DFS tree in passes, where is the height of the final DFS tree. This algorithm essentially computes all the vertices in the next level of the currently built DFS tree simultaneously, building the DFS tree by one level in each pass over the input stream. Thus, in the pass every vertex on the level of the DFS tree is computed. Thirdly in Section 4, we describe an advanced algorithm which uses space to add a path of length at least to the DFS tree in every pass over the input stream. Thus, the complete DFS tree can be computed in passes. Finally, in Section 5, we improve the algorithm to simultaneously add all the subtrees constituting the next levels of the final DFS tree starting from the leaves of the current tree . Thus, levels are added to the DFS tree in each pass over the input stream, computing the DFS tree in passes. As described earlier, our final algorithm is presented in form of a framework which uses as a black box, any algorithm to maintain a DFS tree of a stored sparser subgraph, satisfying certain properties. In the interest of completeness, one such algorithm is described in the Appendix C. Lastly in Section 6, we present the results of the experimental evaluation of these algorithms. The details of this evaluation are deferred to Appendix D.

In our advanced algorithms, we employ two interesting properties of a DFS tree, namely, the components property [8] and the min-height property. These simple properties of any DFS tree prove crucial in building the DFS efficiently in the streaming environment.

2 Preliminaries

Let be an undirected connected graph having vertices and edges. The DFS traversal of starting from any vertex produces a spanning tree rooted at called a DFS tree, in time. For any rooted spanning tree of , a non-tree edge of the graph is called a back edge if one of its endpoints is an ancestor of the other in the tree, else it is called a cross edge. A necessary and sufficient condition for any rooted spanning tree to be a DFS tree is that every non-tree edge is a back edge.

In order to handle disconnected graphs, we add a dummy vertex to the graph and connect it to all vertices. Our algorithm computes a DFS tree rooted at in this augmented graph, where each child subtree of is a DFS tree of a connected component in the DFS forest of the original graph. The following notations will be used throughout the paper.

  •   The DFS tree of incrementally computed by our algorithm.

  •   Parent of in .

  • The subtree of rooted at vertex .

  •   Root of a subtree of , i.e., .

  • Level of vertex in , where and .

In this paper we will discuss algorithms to compute a DFS tree for the input graph in the semi-streaming model. In all the cases will be built iteratively starting from an empty tree. At any time during the algorithm, we shall refer to the vertices that are not a part of the DFS tree as unvisited and denote them by , i.e., . Similarly, we refer to the subgraph induced by the unvisited vertices, , as the unvisited graph. Unless stated otherwise, we shall refer to a connected component of the unvisited graph as simply a component. For any component , the set of edges and vertices in the component will be denoted by and . Further, each component maintains a spanning tree of the component that shall be referred as . We refer to a path in a DFS tree as an ancestor-descendant path if one of its endpoints is an ancestor of the other in . Since the DFS tree grows downwards from the root, a vertex is said to be higher than vertex if . Similarly, among two edges incident on an ancestor-descendant path , an edge is higher than edge if and .

We shall now describe two invariants such that any algorithm computing DFS tree incrementally satisfying these invariants at every stage of the algorithm, ensures the absence of cross edges in and hence the correctness of the final DFS tree .

Invariants: [leftmargin=1cm] All non-tree edges among vertices in are back edges, and For any component of the unvisited graph, all the edges from to the partially built DFS tree are incident on a single ancestor-descendant path of .

We shall also use the components property by Baswana et al. [8], described as follows.

Figure 1: Edges and can be ignored during the DFS traversal (reproduced from [8]).
Lemma 2.1 (Components Property [8]).

Consider a partially completed DFS traversal where is the partially built DFS tree. Let the connected components of be . Consider any two edges and from that are incident respectively on a vertex and some ancestor (not necessarily proper) of in . Then it is sufficient to consider only during the DFS traversal, i.e., the edge can be safely ignored.

Ignoring during the DFS traversal, as stated in the components property, is justified because will appear as a back edge in the resulting DFS tree (refer to Figure 1). For each component , the edge can be found using a single pass over all the graph edges.

3 Simple Algorithms

We shall first briefly describe the trivial algorithm to compute a DFS tree of a (directed) graph using passes. Since we are limited to have only space, we cannot store the adjacency list of the vertices in the graph. Recall that in the standard DFS algorithm [51], after visiting a vertex , we choose any unvisited neighbour of and visit it. If no neighbour of is unvisited, the traversal retreats back to the parent of and look for its unvisited neighbour, and so on.

In the streaming model, we can use the same algorithm. However, we do not store the adjacency list of a vertex. To find the unvisited neighbour of each vertex, we perform a complete pass over the edges in . The algorithm only stores the partially built DFS tree and the status of each vertex (whether it is visited/added to ). Thus, for each vertex (except ) one pass is performed to add to and another is performed before retreating to the parent of . Hence, it takes passes to complete the algorithm since is initialized with the root . Since, this procedure essentially simulates the standard DFS algorithm [51], it clearly satisfies the invariants and .

This procedure can be easily transformed to require only passes by avoiding an extra pass for retreating from each vertex . In each pass we find an edge (from the stream) from the unvisited vertices, , to the lowest vertex on the ancestor-descendant path connecting and , i.e., closest to . Hence would be an edge from the lowest (maximum level) ancestor of (not necessarily proper) having at least one unvisited neighbour. Recall that if does not have an unvisited neighbour we move to processing its parent, and so on until we find an ancestor having an unvisited neighbour. We can thus directly add the edge to . Hence, retreating from a vertex would not require an additional pass and the overall procedure can be completed in passes, each pass adding a new vertex to . Moreover, this also requires processing time per edge and extra time at the end of the pass, to find the relevant ancestor. Refer to Procedure LABEL:alg:simple1 in Appendix E for the pseudocode of the procedure. Thus, we get the following result.

Theorem 3.1.

Given a directed/undirected graph , a DFS tree of can be computed by a semi-streaming algorithm in passes using space, using time per pass.

3.1 Improved algorithm

We shall now describe how this simple algorithm can be improved to compute a DFS tree of an undirected graph in passes, where is the height of the computed DFS tree. The main idea behind this approach is that each component of the unvisited graph will constitute a separate subtree of the final DFS tree. Hence each such subtree can be computed independent of each other in parallel (this idea was also used by [40]).

Using one pass over edges in , the components of the unvisited graph can be found by using Union-Find algorithm [53, 55] on the edges of . Now, using the components property we know that it is sufficient to add the lowest edge from each component to the DFS tree . At the end of the pass, for each component we find the edge incident from the lowest vertex to some vertex and add it to . Note that in the next pass, for each component of the lowest edge connecting it to would necessarily be incident on as was connected. Hence, instead of lowest edge incident on , we store from only if is incident on some leaf of . Refer to Procedure LABEL:alg:simple in Appendix E for the pseudocode of the algorithm.

To prove the correctness of the algorithm, we shall prove using induction that the invariants and hold over the passes performed on . Since is initialized as an isolated vertex , both invariants trivially hold. Now, let the invariants hold at the beginning of a pass. Using , each component can have edges to a single ancestor-descendant path from to . Thus, adding the edge for each component , would not violate at the end of the pass, given that holds at the beginning of the pass. Additionally, from each component we add a single vertex as a child of to . Hence for any component of , the edges to can only be to ancestors of (using of previous pass), and an edge necessarily to , satisfying at the end of the pass. Hence, using induction both and are satisfied proving the correctness of our algorithm.

Further, since each component in any pass necessarily has an edge to a leaf of , the new vertex is added to the level of . This also implies that every vertex at level of the final DFS tree is added during the pass. Hence, after passes we get a DFS tree of the whole graph as is the height of the computed DFS tree. Now, the total time222 The Union-Find algorithm [53, 55] requires time, where is the inverse Ackermann function required to compute the connected components is . And computing an edge from each unvisited vertex to a leaf in requires time using space. Thus, we have the following theorem.

Theorem 3.2.

Given an undirected graph , a DFS tree of can be computed by a semi-streaming algorithm in passes using space, where is the height of the computed DFS tree, using time per pass.

4 Computing DFS in sublinear number of passes

Since a DFS tree may have height, we cannot hope to compute a DFS tree in sublinear number of passes using the previously described simple algorithms. The main difference between the advanced approaches and the simple algorithms is that, in each pass instead of adding a single vertex (say ) to the DFS tree, we shall be adding an entire path (starting from ) to the DFS tree. The DFS traversal gives the flexibility to chose the next vertex to be visited as long as the DFS property is satisfied, i.e., invariants and are maintained.

Hence in each pass we do the following for every component in . Instead of finding a single edge (see Section 3.1), we find a path starting from in and attach this entire path to (instead of only ). Suppose this splits the component into components of . Now, each would have an edge to at least one vertex on (instead of necessarily the leaf in Section 3.1) since was a connected component. Hence in this algorithm for each , we find the vertex which is the lowest vertex of (or ) to which an edge from is incident. Observe that is unique since all the neighbours of in are along one path from the root to a leaf. Using the components property, the selection of as the parent of the root of the subtree to be computed for ensures that invariant continues to hold. Thus, in each pass from every component of the unvisited graph, we shall extract a path and add it to the DFS tree .

This approach thus allows to grow by more than one vertex in each pass which is essential for completing the tree in passes. If in each pass we add a path of length at least from each component of , then the tree will grow by at least vertices in each pass, requiring overall passes to completely build the DFS tree. We shall now present an important property of any DFS tree of an undirected graph, which ensures that in each pass we can find a path of length at least (refer to Appendix A for proof).

Lemma 4.1 (Min-Height Property).

Given a connected undirected graph having edges, any DFS tree of from any root vertex necessarily has a height .

4.1 Algorithm

We shall now describe our algorithm to compute a DFS tree of the input graph in passes. Let the maximum space allowed for computation in the semi-streaming model be . The algorithm is a recursive procedure that computes a DFS tree of a component from a root . For each component we maintain a spanning tree of . Initially we can perform a single pass over to compute a spanning tree of the connected graph (recall the assumption in Section 2) using the Union-Find algorithm. For the remaining components, its spanning tree would already have been computed and passed as an input to the algorithm.

We initiate a pass over the edge in and store the first edges (if possible) from the component in the memory as the subgraph . Before proceeding with the remaining stream, we use any algorithm for computing a DFS tree rooted at in the subgraph containing edges from and . Note that adding to is important to ensure that subgraph induced by is connected. In case the pass was completed before exceeded storing edges, is indeed a DFS tree of and we directly add it to . Otherwise, we find the longest path from starting from , i.e., path from to the farthest leaf. The path is then added to .

Now, we need to compute the connected components of and the new corresponding root for each such component. We use the Union-Find algorithm to compute these components, say , and compute the lowest edge from each on the path . Clearly, there exist such an edge as was connected. In order to find these components and edges, we need to consider all the edges in , which can be done by first considering and then each edge from in the remainder of input stream of the pass. Refer to Procedure LABEL:alg:advAlg1 in Appendix E for the pseudocode of the algorithm.

Using the components property, choosing the new root corresponding to the lowest edge ensures that the invariant and hence is satisfied. Now, in case , the entire DFS tree of is constructed and added to in a single pass. Otherwise, in each pass we add the longest path from to the final DFS tree . Since and is a single connected component, the min-height property ensures that the height of any such (and hence ) is at least . Since in each pass, except the last, we add at least new vertices to , this algorithm terminates in at most passes. Now, the total time required to find the components of the unvisited graph is again . The remaining operations clearly require time for a component , requiring overall time. Thus, we get the following theorem.

Theorem 1.1.

Given an undirected graph , a DFS tree of can be computed by a semi-streaming algorithm in at most passes using space, requiring time per pass.

Remark 4.1.

Since, Procedure LABEL:alg:advAlg1 adds an ancestor-descendant path for each component of , it might seem that the analysis of the algorithm is not tight for computing DFS trees with height. However, there exist a sequence of input edges where Procedure LABEL:alg:advAlg1 indeed takes passes for computing a DFS tree with height (see Appendix B).

5 Final algorithm

We shall now further improve the algorithm so that the required number of passes reduces to while it continues to use space, where is the height of the computed DFS tree and is any positive integer. To understand the main intuition behind our approach, let us recall the previously described algorithms. We first described a simple algorithm (in Section 3) in which every pass over the input stream adds one new vertex as the child of some leaf of , which was improved (in Section 3.1) to simultaneously adding all vertices which are children of the leaves of in the final DFS tree. We then presented another algorithm (in Section 4) in which every pass over the input stream adds one ancestor-descendant path of length or more, from each component of to . We shall now improve it by adding all the subtrees constituting the next levels of the final DFS tree starting from the leaves of the current tree (or fewer than levels if the corresponding component of is exhausted).

Now, consider any component of . Let be a vertex having an edge to a leaf of the partially built DFS tree . The computation of can be completed by computing a DFS tree of from the root , which can be directly attached to using . However, computing the entire DFS tree of may not be possible in a single pass over the input stream, due to the limited storage space available. Thus, using space we compute a special spanning tree for each component of in parallel, such that the top levels of is same as the top levels of some DFS tree of . As a result, in the pass all vertices on the levels to of the final DFS tree are added to . This essentially adds a tree representing the top levels of for each component of . This ensures that our algorithm will terminate in passes, where is the height of the final DFS tree. Further, this special tree also ensures an additional property, i.e., there is a one to one correspondence between the set of trees of and the components of . In fact, each tree of is a spanning tree of the corresponding component. This property directly provides the spanning trees of the components of in the next pass.

Special spanning tree

We shall now describe the properties of this special tree (and hence ) which is computed in a single pass over the input stream. For to be added to the DFS tree of the graph, a necessary and sufficient condition is that satisfies the invariants and at the end of the pass. To achieve this we maintain to be a spanning tree of , such that these invariants are maintained by the corresponding throughout the pass as the edges are processed. Let be the set of edges already visited during the current pass, which have both endpoints in . In order to satisfy , no edge in should be a cross edge in , i.e., no edge having both endpoints in the top levels of is a cross edge. In order to satisfy , no edge in from any component to should be a cross edge in . Hence, using the additional property of , each edge from a tree in to is necessarily a back edge. This is captured by the two conditions of invariant given below. Hence should hold after processing each edge in the pass. Observe that any spanning tree, , trivially satisfies at the beginning of the pass as .

Invariant is a spanning tree of with the top levels being such that: [leftmargin=1cm] All non-tree edges of having both endpoints in , are back edges. For each tree in , all the edges of from to are back edges.

Thus, is the local invariant maintained by during the pass, so that the global invariants and are maintained throughout the algorithm. Now, in order to compute (and hence ) satisfying the above invariant, we store a subset of along with . Let denote the (spanning) subgraph of formed by along with these additional edges. Note that all the edges of cannot be stored in due to space limitation of . Since each pass starts with the spanning tree of and , initially . As the successive edges of the stream are processed, is updated if the input edge belongs to the component . We now formally describe and its properties.

Spanning subgraph

As described earlier, at the beginning of a pass for every component of , . Now, the role of is to facilitate the maintenance of the invariant . In order to satisfy and , we store in all the edges in that are incident on at least one vertex of . Therefore, is the spanning tree along with every edge in which has at least one endpoint in . Thus, satisfies the following invariant throughout the algorithm.

Invariant comprises of and all edges from that are incident on at least one vertex of .

We shall now describe a few properties of and then in the following section show that maintaining for is indeed sufficient to maintain the invariant as the stream is processed. The following properties of are crucial to establish the correctness of our procedure to maintain and and establish a bound on total space required by (see Appendix A for proofs).

Lemma 5.1.

is a valid DFS tree of .

Lemma 5.2.

The total number of edges in , for all the components of , is .

5.1 Processing of Edges

We now describe how and are maintained while processing the edges of the input stream such that and are satisfied. Since our algorithm maintains the invariants and (because of ), we know that any edge whose both endpoints are not in some component of , is either a back edge or already a tree edge in . Thus, we shall only discuss the processing of an edge having both endpoints in (now added to ), where .

  1. If then the edge is added to to ensure . In addition, if is a cross edge in it violates either (if ) or (if ). Thus, is required to be restructured to ensure that is satisfied.

  2. If and if and belong to different trees in , then it violates . Again in such a case, is required to be restructured to ensure that is satisfied.

Note that after restructuring we need to update such that is satisfied. Consequently any non-tree edge in that was incident on a vertex in original , has to be removed from if none of its endpoints are in after restructuring , i.e., one or both of its endpoints have moved out of . But the problem arises if a vertex moves into during restructuring. There might have been edges incident on such a vertex in and which were not stored in . In this case we need these edges in to satisfy , which is not possible without visiting again. This problem can be avoided if our restructuring procedure ensures that no new vertex enters . This can be ensured if the restructuring procedure follows the property of monotonic fall, i.e., the level of a vertex is never decreased by the procedure. Let be the new edge of component in the input stream. We shall show that in order to preserve the invariants and it is sufficient that the restructuring procedure maintains the property of monotonic fall and ensures that the restructured is a DFS tree of .

Lemma 5.3.

On insertion of an edge , any restructuring procedure which updates to be a valid DFS tree of ensuring monotonic fall, satisfies the invariants and .

Proof.

The property of monotonic fall ensures that the vertex set of new is a subset of the vertex set of the previous . Using we know that any edge of which is not present in must have both its endpoints outside . Hence, monotonic fall guarantees that continues to hold with respect to the new for the edges in . Additionally, we save in the new if at least one of its endpoints belong to the new , ensuring that holds for the entire .

Since the restructuring procedure ensures that the updated is a DFS tree of , the invariant trivially holds as a result of . In order to prove the invariant , consider any edge from a tree to . Clearly, it will satisfy if , as is a DFS tree of . In case , it must be internal to some tree in the original (using in the original ). We shall now show that such an edge will remain internal to some tree in the updated as well, thereby not violating . Clearly the endpoints of cannot be in the updated due to the property of monotonic fall.

Assume that the endpoints of belong to different trees of updated . Now, consider the edges on the tree path in connecting the endpoints of . Since the entire tree path is in , the endpoints of each are not in original , ensuring that they are also not in the updated (by monotonic fall). Since the endpoints of (and hence the endpoints of the path ) are in different trees in updated , there must exist some which also has endpoints belonging to different trees of updated . This makes a cross edge of the updated . Since is a tree edge of original , it belongs to and hence being a cross edge implies that the updated is not a DFS tree of , which is a contradiction. Hence has both its endpoints in the same tree of the updated , ensuring that holds after the restructuring procedure. ∎

Hence, any procedure to restructure a DFS tree of the subgraph on insertion of a cross edge , that upholds the property of monotonic fall and returns a new which is a DFS tree of , can be used as a black box in our algorithm. One such algorithm is the incremental DFS algorithm by Baswana and Khan [10], which precisely fulfils our requirement. They proved the total update time of the algorithm to be . They also showed that any algorithm maintaining incremental DFS abiding monotonic fall would require time even for sparse graphs, if it explicitly maintains the DFS tree. If the height of the DFS tree is known to be , these bounds reduces to and respectively, where is the number of edges processed by the algorithm. Refer to Appendix C for a brief description of the algorithm.

5.2 Algorithm

We now describe the details of our final algorithm which uses Procedure LABEL:alg:rebuild [10] (described in Appendix C) for restructuring the DFS tree when a cross edge is inserted. Similar to the algorithm in Section 4, for each component of , a rooted spanning tree of the component is required as an input to the procedure having the root .

Initially and has a single component , as is connected (recall the assumption in Section 2). Hence for the first pass, we compute a spanning tree of using the Union-Find algorithm. Subsequently in each pass we directly get a spanning tree for each component of the new , which is the corresponding tree in , where is the component containing in the previous pass. Also, observe that the use of these trees as the new ensures that the level of no vertex ever rises in the context of the entire tree . This implies that the level of any vertex starting with the initial spanning tree never rises, i.e., the entire algorithm satisfies the property of monotonic fall. We will use this fact crucially in the analysis of the time complexity.

As described earlier, we process the edges of the stream by updating the and maintaining and respectively. In case the edge is internal to some tree in (i.e., have both endpoint in the same tree in ), we simply ignore the edge. Otherwise, we add it to to satisfy . Further, Procedure LABEL:alg:rebuild maintains to be a DFS tree of , which restructures if the processed edge is added to and is a cross edge in . Now, in case is updated we also update the subgraph , by removing the extra non-tree edges having both endpoints in . After the pass is completed, we attach (the top levels of ) to . Now, ensures that each tree of forms the (rooted) spanning tree of the components of the new , and hence can be used for the next pass. Refer to Procedure LABEL:alg:advAlg2 in Appendix E for the pseudocode of the algorithm.

5.3 Correctness and Analysis

The correctness of our algorithm follows from Lemma 5.3, which ensures that invariants and (and hence and ) are maintained as a result of using Procedure LABEL:alg:rebuild which ensures monotonic fall of vertices. The total space used by our algorithm and the restructuring procedure is dominated by the cumulative size of for all components of , which is using Lemma 5.2. Now, in every pass of the algorithm, a DFS tree for each component of height is attached to . These trees collectively constitute the next levels of the final DFS tree . Therefore, the entire tree is computed in passes.

Let us now analyse the time complexity of our algorithm. In the first pass time is required to compute the spanning tree using the Union-Find algorithm. Also, in each pass time is required to process the input stream. Further, in order to update we are required to delete edges having both endpoints out of . Hence, whenever a vertex falls below the level, the edges incident on it are checked for deletion from (if the other endpoint is also not in ). Total time required for this is per pass. Now, Appendix C describes the details of Procedure LABEL:alg:rebuild which maintains the DFS tree in total time, where , for processing the entire input stream in each pass.

Finally, we need to efficiently answer the query whether an edge is internal to some tree in . For this we maintain for each vertex its ancestor at level as , i.e., is the root of the tree in that contains . If , then . For an edge comparing the and efficiently answers the required query in time. However, whenever is updated we need to update for each vertex in the modified part of , requiring time per vertex in the modified part of . We shall bound the total work done to update of such a vertex throughout the algorithm to as follows.

Consider the potential function . Whenever some part of is updated, each vertex in the modified necessarily incurs a fall in its level (due to monotonic fall). Thus, the cost of updating throughout the algorithm is proportional to the number of times descends in the tree, hence increases the value of by at least one unit. Hence, updating for all in the modified part of can be accounted by the corresponding increase in the value of . Clearly, the maximum value of is , since the level of each vertex is always less than , where is the height of the computed DFS tree. Thus, the total work done to update for all is . This proves our main theorem described in Section 1.1 which is stated as follows.

Theorem 1.

Given an undirected graph , a DFS tree of can be computed by a semi-streaming algorithm using passes using space requiring amortized time per pass for any integer , where is the height of the computed DFS tree.

Remark: Note that the time complexity of our algorithm is indeed tight for our framework. Since our algorithm requires passes and any restructuring procedure following monotonic fall requires time, each pass would require time.

6 Experimental Evaluation

Most streaming algorithms deal with only space, for which our advanced algorithms improve over the simple algorithms theoretically by just constant factors. However, their empirical performance demonstrates their significance in the real world applications. The evaluation of our algorithms on random and real graphs shows that in practice these algorithms require merely a few passes even when allowed to store just edges. The results of our analysis can be summarized as follows (for details refer to Appendix D).

The two advanced algorithms kPath (Algorithm LABEL:alg:advAlg1 in Section 4) and kLev (Algorithm LABEL:alg:advAlg2 in Section 5

with an additional heuristic) perform much better than the rest even when

space is allowed. For both random and real graphs, kPath performs slightly worse as the density of the graph increases. On the other hand kLev performs slightly better only in random graphs with the increasing density. The effect of the space parameter is very large on kPath from to small constants, requiring very few passes even for and . However, kLev seems to work very well even for and has a negligible effect of increasing the value of . Overall, the results suggest using kPath if space is allowed for being a small constant such as or . However, if the space restrictions are extremely tight it is better to use kLev.

7 Conclusion

We presented the first pass semi-streaming algorithm for computing a DFS tree for an undirected graph, breaking the long standing presumed barrier of passes. In our streaming model we assume that local space is available for computation, where is any natural number. Our algorithm computes a DFS tree in passes. We improve our algorithm to require only passes without any additional space requirement, where is the height of the final tree. This improvement becomes significant for graphs having shallow DFS trees. Moreover, our algorithm is described as a framework using a restructuring algorithm as a black box. This allows more flexibility to extend our algorithm for solving other problems requiring a computation of DFS tree in the streaming environment.

Recently, in a major breakthrough Elkin [20] presented the first pass algorithm for computing Shortest Paths Tree from a single source. Using local space, it computes the shortest path tree from a given source in passes for unweighted graphs and in passes for weighted graphs.

Despite the fact that these breakthroughs provide only minor improvements (typically factors), they are significant steps to pave a path in better understanding of such fundamental problems in the streaming environment. These simple improvements come after decades of the emergence of streaming algorithms for graph problems, where such problems were considered implicitly hard in the semi-streaming environment. We thus believe that our result is a significant improvement over the known algorithm for computing a DFS tree in the streaming environment, and it can be a useful step in more involved algorithms that require the computation of a DFS tree.

Moreover, the experimental evaluation of our algorithms revealed exceptional performance of the advanced algorithms kPath and kLev (greatly affected by the additional heuristic). Thus, it would be interesting to further study these algorithms theoretically which seem to work extremely well in practice.

References

  • [1] Hamsterster full network dataset – KONECT, September 2016.
  • [2] Kook Jin Ahn and Sudipto Guha. Linear programming in the semi-streaming model with application to the maximum matching problem. Inf. Comput., 222:59–79, 2013.
  • [3] Noga Alon, Yossi Matias, and Mario Szegedy.

    The space complexity of approximating the frequency moments.

    Journal of Computer and System Sciences, 58(1):137 – 147, 1999.
  • [4] Giorgio Ausiello, Donatella Firmani, and Luigi Laura. Datastream computation of graph biconnectivity: Articulation points, bridges, and biconnected components. In Theoretical Computer Science, 11th Italian Conference, ICTCS 2009, Cremona, Italy, September 28-30, 2009, Proceedings., pages 26–29, 2009.
  • [5] Giorgio Ausiello, Donatella Firmani, and Luigi Laura. Real-time monitoring of undirected networks: Articulation points, bridges, and connected and biconnected components. Networks, 59(3):275–288, 2012.
  • [6] Ziv Bar-Yossef, Ravi Kumar, and D. Sivakumar. Reductions in streaming algorithms, with an application to counting triangles in graphs. In Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’02, pages 623–632, 2002.
  • [7] Surender Baswana. Streaming algorithm for graph spanners - single pass and constant processing time per edge. Inf. Process. Lett., 106(3):110–114, 2008.
  • [8] Surender Baswana, Shreejit Ray Chaudhury, Keerti Choudhary, and Shahbaz Khan. Dynamic DFS in undirected graphs: breaking the O(m) barrier. In ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 730–739, 2016.
  • [9] Surender Baswana, Ayush Goel, and Shahbaz Khan. Incremental DFS algorithms: a theoretical and experimental study. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pages 53–72, 2018.
  • [10] Surender Baswana and Shahbaz Khan. Incremental algorithm for maintaining a DFS tree for undirected graphs. Algorithmica, 79(2):466–483, 2017.
  • [11] Thijs Beuming, Lucy Skrabanek, Masha Y. Niv, Piali Mukherjee, and Harel Weinstein. PDZBase: A protein–protein interaction database for PDZ-domains. Bioinformatics, 21(6):827–828, 2005.
  • [12] Marián Boguñá, Romualdo Pastor-Satorras, Albert Díaz-Guilera, and Alex Arenas. Models of social networks based on social distance attachment. Phys. Rev. E, 70:056122, Nov 2004.
  • [13] Béla Bollobás. The evolution of random graphs. Transactions of the American Mathematical Society, 286 (1):257–274, 1984.
  • [14] Glencora Borradaile, Claire Mathieu, and Theresa Migler. Lower bounds for testing digraph connectivity with one-pass streaming algorithms. CoRR, abs/1404.1323, 2014.
  • [15] D. E. Boyce, K. S. Chon, M. E. Ferris, Y. J. Lee, K-T. Lin, and R. W. Eash. Implementation and evaluation of combined models of urban travel and location on a sketch planning network. Chicago Area Transportation Study, pages xii + 169, 1985.
  • [16] Adam L. Buchsbaum, Raffaele Giancarlo, and Jeffery Westbrook. On finding common neighborhoods in massive graphs. Theor. Comput. Sci., 299(1-3):707–718, 2003.
  • [17] Eunjoon Cho, Seth A. Myers, and Jure Leskovec. Friendship and mobility: User movement in location-based social networks. In Proc. Int. Conf. on Knowledge Discovery and Data Mining, pages 1082–1090, 2011.
  • [18] R. W. Eash, K. S. Chon, Y. J. Lee, and D. E. Boyce. Equilibrium traffic assignment on an aggregated highway network for sketch planning. Transportation Research Record, 994:30–37, 1983.
  • [19] Michael Elkin. Streaming and fully dynamic centralized algorithms for constructing and maintaining sparse spanners. ACM Trans. Algorithms, 7(2):20:1–20:17, 2011.
  • [20] Michael Elkin. Distributed exact shortest paths in sublinear time. In

    Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, Montreal, QC, Canada, June 19-23, 2017

    , pages 757–770, 2017.
  • [21] P. Erdős and A Rényi. On the evolution of random graphs. In Publication of the Mathematical Institute of the Hungarian Academy of Sciences, pages 17–61, 1960.
  • [22] Shimon Even and Robert Endre Tarjan. Network flow and testing graph connectivity. SIAM J. Comput., 4(4):507–518, 1975.
  • [23] Martin Farach-Colton, Tsan-sheng Hsu, Meng Li, and Meng-Tsung Tsai. Finding articulation points of large graphs in linear time. In Algorithms and Data Structures, WADS, pages 363–372, Cham, 2015. Springer International Publishing.
  • [24] Joan Feigenbaum, Sampath Kannan, Andrew McGregor, Siddharth Suri, and Jian Zhang. Graph distances in the streaming model: The value of space. In Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’05, pages 745–754, 2005.
  • [25] Joan Feigenbaum, Sampath Kannan, Andrew McGregor, Siddharth Suri, and Jian Zhang. On graph problems in a semi-streaming model. Theor. Comput. Sci., 348(2):207–216, December 2005.
  • [26] Joan Feigenbaum, Sampath Kannan, Martin J. Strauss, and Mahesh Viswanathan. An approximate l1-difference algorithm for massive data streams. SIAM J. Comput., 32(1):131–151, January 2003.
  • [27] Christiane Fellbaum, editor. WordNet: an Electronic Lexical Database. MIT Press, 1998.
  • [28] Philippe Flajolet and G. Nigel Martin. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31(2):182 – 209, 1985.
  • [29] Alan M. Frieze and Michal Karonski. Introduction to Random Graphs. Cambridge University Press, 2015.
  • [30] Pablo M. Gleiser and Leon Danon. Community structure in jazz. Advances in Complex Systems, 6(4):565–573, 2003.
  • [31] Sudipto Guha, Nick Koudas, and Kyuseok Shim. Data-streams and histograms. In Proceedings of the Thirty-third Annual ACM Symposium on Theory of Computing, STOC ’01, pages 471–475, 2001.
  • [32] Venkatesan Guruswami and Krzysztof Onak. Superlinear lower bounds for multipass graph processing. Algorithmica, 76(3):654–683, Nov 2016.
  • [33] Venkatesan Guruswami and Krzysztof Onak. Superlinear lower bounds for multipass graph processing. Algorithmica, 76(3):654–683, 2016.
  • [34] Monika Rauch Henzinger, Prabhakar Raghavan, and Sridhar Rajagopalan. Computing on data streams. In External Memory Algorithms, Proceedings of a DIMACS Workshop, New Brunswick, New Jersey, USA, May 20-22, 1998, pages 107–118, 1998.
  • [35] John E. Hopcroft and Richard M. Karp. An n algorithm for maximum matchings in bipartite graphs. SIAM J. Comput., 2(4):225–231, 1973.
  • [36] John E. Hopcroft and Robert Endre Tarjan. Efficient planarity testing. J. ACM, 21(4):549–568, 1974.
  • [37] Piotr Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computation. J. ACM, 53(3):307–323, May 2006.
  • [38] Sagar Kale and Sumedh Tirodkar. Maximum matching in two, three, and a few more passes over graph streams. In

    Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2017, August 16-18, 2017, Berkeley, CA, USA

    , pages 15:1–15:21, 2017.
  • [39] Michael Kapralov. Better bounds for matchings in the streaming model. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2013, New Orleans, Louisiana, USA, January 6-8, 2013, pages 1679–1697, 2013.
  • [40] Shahbaz Khan. Near optimal parallel algorithms for dynamic DFS in undirected graphs. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2017, Washington DC, USA, July 24-26, 2017, pages 283–292, 2017.
  • [41] Lasse Kliemann. Engineering a bipartite matching algorithm in the semi-streaming model. In Algorithm Engineering - Selected Results and Surveys, pages 352–378. Springer International Publishing, 2016.
  • [42] Donald E. Knuth. The Art of Computer Programming, Volume 4, Fascicle 0: Introduction to Combinatorial and Boolean Functions. Addison-Wesley, 2008.
  • [43] Jérôme Kunegis. KONECT - The Koblenz Network Collection. http://konect.uni-koblenz.de/networks/, October 2016.
  • [44] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graph evolution: Densification and shrinking diameters. ACM Trans. Knowledge Discovery from Data, 1(1):1–40, 2007.
  • [45] Julian McAuley and Jure Leskovec. Learning to discover social circles in ego networks. In Advances in Neural Information Processing Systems, pages 548–556. 2012.
  • [46] Andrew McGregor. Finding graph matchings in data streams. In Approximation, Randomization and Combinatorial Optimization, Algorithms and Techniques, 8th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems, APPROX 2005 and 9th InternationalWorkshop on Randomization and Computation, RANDOM 2005, Berkeley, CA, USA, August 22-24, 2005, Proceedings, pages 170–181, 2005.
  • [47] Andrew McGregor. Graph stream algorithms: A survey. SIGMOD Rec., 43(1):9–20, May 2014.
  • [48] Shan Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends in Theoretical Computer Science, 1(2):117–236, 2005.
  • [49] Thomas C. O’Connell. A survey of graph algorithms under extended streaming models of computation. In Fundamental Problems in Computing: Essays in Honor of Professor Daniel J. Rosenkrantz, pages 455–476, 2009.
  • [50] Jan Matthias Ruhl. Efficient algorithms for new computational models. PhD Thesis, Department of Computer Science, MIT, Cambridge, MA, 2003.
  • [51] Robert Endre Tarjan. Depth-first search and linear graph algorithms. SIAM J. Comput., 1(2):146–160, 1972.
  • [52] Robert Endre Tarjan. Finding dominators in directed graphs. SIAM J. Comput., 3(1):62–89, 1974.
  • [53] Robert Endre Tarjan. Efficiency of a good but not linear set union algorithm. J. ACM, 22(2):215–225, April 1975.
  • [54] Robert Endre Tarjan. Edge-disjoint spanning trees and depth-first search. Acta Inf., 6:171–185, 1976.
  • [55] Robert Endre Tarjan and Jan van Leeuwen. Worst-case analysis of set union algorithms. J. ACM, 31(2):245–281, 1984.
  • [56] Jeffery Westbrook and Robert Endre Tarjan. Maintaining bridge-connected and biconnected components on-line. Algorithmica, 7(5&6):433–464, 1992.
  • [57] Jaewon Yang and Jure Leskovec. Defining and evaluating network communities based on ground-truth. In Proc. ACM SIGKDD Workshop on Mining Data Semantics, page 3. ACM, 2012.
  • [58] R. Zafarani and H. Liu. Social computing data repository at ASU, 2009.
  • [59] Jian Zhang. A survey on streaming algorithms for massive graphs. In Managing and Mining Graph Data, pages 393–420. Springer US, 2010.

Appendix A Omitted Proofs

Lemma 4.1 (Min-Height Property).

Given a connected undirected graph having edges, any DFS tree of from any root vertex necessarily has a height .

Proof.

We know that each non-tree edge in a DFS tree of an undirected graph is a back edge. We shall associate each edge to its lower endpoint. Thus, in a DFS tree each vertex will be associated to a tree edge to its parent and back edges only to its ancestors. Now, each vertex can have only ancestors as the height of the DFS tree is , Hence each vertex has only edges associated to it resulting in less than edges, i.e. or . Note that it is important for the graph to be connected otherwise from some root the corresponding component and hence its DFS tree can be much smaller. ∎

Lemma 5.1.

is a valid DFS tree of .

Proof.

In order to prove this claim it is sufficient to prove that all the non-tree edges stored in are back edges in , i.e., the endpoints of every such edge share an ancestor-descendant relationship. Now, invariant ensures that any edge in having both endpoints in is a back edge. And invariant ensures that any edge between a vertex in and is a back edge. Hence, all the non-tree edges incident on (and hence all non-tree edges in ) are back edges, proving our lemma. ∎

Lemma 5.2.

The total number of edges in , for all the components of , are .

Proof.

The size of can be analysed using invariant as follows. The number of tree edges in (and hence in ) is . The non-tree edges stored by have at least one endpoint in . Using Lemma 5.1 we know that all these edges are back edges. To bound the number of such edges let us associate each non-tree edge to its lower endpoint. Hence each vertex will be associated to at most non-tree edges to its ancestors in (recall that is the top levels of ). Thus, stores tree edges and non-tree edges, i.e., total edges. Since , the total number of edges in is . ∎

Appendix B Tightness of analysis of Procedure LABEL:alg:advAlg1

We now describe a worst case example proving the tightness of the analysis of Procedure LABEL:alg:advAlg1. To prove this we present a sequence of edges in the input stream, such that in each pass the algorithm extracts the longest path of the DFS tree, which is computed using the first edges of the input stream that are internal to the component. We shall show that Procedure LABEL:alg:advAlg1 will require passes to build the DFS tree for such a graph where the height of the tree is . Note that the amortized time required by the algorithm in every pass is clearly optimal upto factors, hence the focus here is merely on the number of passes.

Figure 2: Example to prove the tightness of analysis of Procedure LABEL:alg:advAlg1.

Consider the tree shown in Figure 2

(i). At the beginning of every pass of the algorithm, the unvisited graph will be connected and the vertices will be classified into four sets

and , depending on a parameter to be fixed later on. The set consists of vertices connected in the form of a path. The vertex is also the root of the DFS tree of the component. The set is also connected in the form of a path. The set of vertices are connected to all the vertices in , where . The set contains the remaining vertices of the component each having an edge to . The value of is chosen so that the total number of edges described above becomes exactly the number of edges that can be stored by the procedure, i.e., . This graph of edges will henceforth be called as the partial graph for the pass. The value of ensures that the longest path from root to a leaf, is the path connecting to the leaf . Clearly, the only DFS tree possible for the partial graph is shown in Figure 2. Note that the subtree may be computed differently with possibly one vertex of between every two vertices of , but the corresponding analysis remains same.

Now, the edges described above would appear first in the input edge sequence, and hence would be stored by Procedure LABEL:alg:advAlg1 in the first pass, to compute the DFS tree shown in Figure 2 (i). As a result the path connecting with is added to . The next edges in the stream connect to , and all the vertices in , as shown in Figure 2 (ii). This makes sure that the unvisited graph remains connected after the end of this pass, and the lowest edge from this component on is from , making it the root of the DFS tree of the unvisited component. The corresponding spanning tree computed for the next pass is shown in Figure 2 (iii). Notice its similarity with Figure 2 (i), where and . Hence, the above construction can be repeated for each pass adding a path of length in every pass. Also, note that the number of edges in the partial graph would have now decreased as the size of has decreased (by vertices). Hence we may have to add a vertex from to (by accordingly placing the corresponding edges next in the stream) to ensure edges in the partial graph. But will grow much slower as compared to the number of passes, as each addition to adds edges to the partial graph, whereas each pass reduces edges from the partial graph. Hence the construction can continue for passes, where the size of grows by in each pass resulting in a DFS tree of height . Thus, we get the following lemma.

Lemma B.1.

There exists a sequence of edges for which Procedure LABEL:alg:advAlg1 takes passes to compute a DFS tree of height .

Remark B.1.

Showing the tightness of the number of passes without any restriction on the height of the tree is very easy. Simply order edges of the graph by the indices of its endpoint having a lower index. Every pass would consider all the edges of vertices with the lowest index, which will result in the final DFS tree being a single path of length , computed in passes. However, such an example does not highlight the importance of Procedure LABEL:alg:advAlg2 which computes a DFS tree in passes, where is the height of the DFS tree.

Appendix C Restructuring procedure [10]

We now briefly describe the restructuring procedure by Baswana and Khan [10]. For the sake of simplicity, here we only describe how restructuring a DFS tree of is performed on insertion of a cross edge is achieved abiding monotonic fall, i.e., the DFS tree is restructured such that the level of each vertex only increases. Hence, we will not describe the various optimizations used to achieve the tight bound on total update time. The procedure essentially adds the given cross edge into the DFS tree, restructuring it accordingly. In particular it reverses only a single path in the tree (see Figure 3). However, this results in some back edges of the tree to become cross edges. These edges can be efficiently identified and removed from the graph and inserted back to the graph iteratively following the same procedure.

Figure 3: Rerooting the tree at and hanging it from . Notice that some back edges may become cross edges (shown dotted) due to this rerooting (reproduced from [10]).

Let us now describe the procedure in detail. A pool of edges is initialized with the inserted edge . Then the edges in are processed until it becomes empty. For each edge where , it is first removed from and checked whether it is a cross edge. This can be easily verified if both and are different from their lowest common ancestor (LCA) . In case of a back edge, the edge is simply ignored processing of continues. Otherwise, let be the ancestor of (not necessarily proper) that is a child of . Since is a cross edge, both and would surely exist. The DFS tree is then restructured by removing the edge and adding the edge . As a result the entire tree path from to would now hang from which was earlier hanging from , reversing the parent child relation for all the edges on this path. However, as a result of this restructuring several edges of may now become cross edges. In order to maintain as the DFS tree of , these cross edges are collected from and added to the pool of edges , which are then iteratively processed using the same procedure. Thus, the algorithm maintains the following invariant.

Invariant: is a DFS tree of .

algocf[htbp]      

Hence, when the list is empty, is a valid DFS tree of . Refer to Procedure LABEL:alg:rebuild for the pseudocode of the restructuring procedure. Now, observe that since the level of vertices can only increase as a result of the path reversal described above, ensuring monotonic fall. This also ensures the termination of the algorithm as the vertices cannot fall beyond the level . Moreover, the analysis [10] of the procedure ensures that the total work done to restructure (including maintenance of LCA structures etc.) can be associated to constant times the fall in level of vertices (similar to described in Section 5). Since the each vertex can only fall by levels, the total fall of vertices is bounded by , where is the height of the final DFS tree. Further, recall that Procedure LABEL:alg:advAlg2 also satisfied monotonic fall. Hence, the total time taken by Procedure LABEL:alg:rebuild to restructure throughout the algorithm across all passes is . However, we also need to account for the time required to process an input edge whenever the procedure is invoked. This requires total time, where is the number of input edges processed by the procedure in all the passes of the algorithm, which results in the following theorem.

Theorem C.1.

Given an undirected graph , its DFS tree can be rebuild after insertion of cross edges by a procedure abiding monotonic fall requiring total time across all invocations of the procedure, where is the height of the computed DFS tree and is the number of input edges processed.

Remark C.1.

Baswana and Khan [10] showed that any algorithm maintaining a DFS tree incrementally abiding monotonic fall, necessarily requires time to maintain the tree explicitly even for sparse graphs. However, if we present the bound in terms of the height of the DFS tree, the corresponding bound reduces to as every algorithm requires time to process each of the input edges. In the streaming environment, where multiple passes over input stream are performed, this bound naturally extends to , where is the number of edges processed during all the passes over the input stream.

Appendix D Experimental Evaluation

We now perform an experimental evaluation of the algorithms to understand their significance in practice. The main criterion of evaluation is the number of passes required to completely build the DFS tree, instead of the time taken. This makes the evaluation independent of the computing platform, programming environment, and code efficiency, resulting in easier reproduction and verification of this study. For random graphs, the results of each experiment are averaged over several test cases to get the expected behaviour.

A related experimental study was performed by Baswana et al. [9] which analysed different incremental DFS algorithms on random and real graphs. For random graphs, they also presented simple single pass algorithms to build a DFS tree using space. Moreover, they also presented the following property which shall be used to describe some results during the course of our evaluation.

Theorem D.1 (DFS Height Property[9]).

The depth of a DFS tree of a random graph , with is at least

with high probability.

We shall also be using the following properties regarding the thresholds of the phase transition phenomenon of Random Graphs to describe the performance of algorithms.

Theorem D.2 (Connectivity Threshold[29]).

Graph with is connected with probability at least , for any constant .

Theorem D.3 (Giant Component Threshold[29]).

Graph with for any constant , the graph contains a single giant component having vertices and edges while the residual components have at most the size .

d.1 Datasets

In our experiments we considered the following two types of datasets.

  • Random Graphs: The initial graph is the star graph, formed by adding an edge from a dummy vertex to each vertex (recall Section 2). The update sequence is generated based on Erdős Rényi model [13, 21] by choosing the first edges of a random permutation of all the edges in the graph.

  • Real graphs: We use several publicly available undirected graphs from real world. We derived these graphs from the KONECT dataset [43]. These datasets are of different types, namely, online social networks (HM [1], Apgp [12], BrightK [17], LMocha [58], FlickrE [45], Gowalla [17]), human networks (AJazz [30], ArxAP [44], Dblp [57]), recommendation networks (Douban [58], Amazon [57]), infrastructure (CU [42], CH [18, 15]), autonomous systems (AsCaida [44]), lexical words (Wordnet [27]) and protein base (Mpdz [11]).

d.2 Algorithms

During the course of these experiments, we modified the algorithms described previously using some obvious heuristics to improve their empirical performance. The analyzed algorithms are as follows.

  • Simple Algorithm (Simp): This algorithm refers to Procedure LABEL:alg:simple1 described in Section 3, which adds one new vertex to the DFS tree during each pass, requiring exactly passes irrespective of the data set. However, notice that after having found a new vertex , the residual pass is wasted. This can be used to possibly find the next neighbour of if an edge exists in the residual pass, and then possibly the neighbour of , and so on. Hence, using this additional heuristic the algorithm now possibly adds more vertices in each pass, requiring less than passes. The algorithm without using the additional heuristic shall be referred to as SimpO.

  • Improved Algorithm (Imprv): This algorithm refers to Procedure LABEL:alg:simple described in Section 3.1, where in the pass all the vertices in the level of the final DFS tree are added. Thus, this algorithm requires exactly passes, where is the height of the computed DFS tree.

  • K Path Algorithm (kPath): This algorithm refers to Procedure LABEL:alg:advAlg1 described in Section 4, where by using edges each pass adds a path of length at least to the DFS tree, for each component of the unvisited graph. For a component of vertices, the algorithm essentially computes an auxiliary DFS tree using the first edges of the pass and adds the longest path of this tree to the final DFS tree.

  • K Level Algorithm (kLev): This algorithm refers to Procedure LABEL:alg:advAlg2 described in Section 5, where in each pass by using edges a spanning tree is computed whose top levels are the next levels of the final DFS tree. This requires exactly passes, where is the height of the final DFS tree. However, it is evident from the algorithm that if some vertex and all its ancestors are not modified during a pass, it will remain so in the final DFS tree. Hence, we use an additional heuristic which also adds such unmodified vertices at the end of the pass to the DFS tree. The algorithm without using the additional heuristic shall be referred to as kLevO.

d.3 Experiments on Random Graphs

Figure 4: For various algorithms, the plot shows the number of passes required to build a DFS tree of a random graph with edges for different values of . The advanced algorithms are allowed to store edges.

We first compare in Figure 4, the expected number of passes taken by different algorithms for random graphs having edges, for different values of up to . The number of passes required by SimpO clearly matches the number of vertices as expected. The performance of Simp is strikingly better than SimpO demonstrating the significance of the additional heuristic. The variation of Imprv essentially shows the expected height of the DFS tree for edges. The advanced algorithms are evaluated using edges, for . The algorithm kPath performs extremely well, showing the presence of deep DFS tree of a random graph even with edges (as expected from DFS height property), and thereafter splitting into small components. It requires the minimum number of passes (recall that kPath and kLev uses an additional pass to determine the components) for the values of having , after which it still requires merely passes. Notice that this is against the expectation because when , or , the algorithm should require minimum passes. The number of passes taken by kLevO for a given value of is indeed close to times the number of passes taken by Imprv, as expected by the theoretical bounds. However, the performance of kLev is remarkably better as compared to kLevO demonstrating the significance of the additional heuristic. Apparently, the whole of DFS tree is fixed within a few passes, after which kLevO merely adds the top levels to the final DFS tree in each pass. Thus, the role of the additional heuristic is very significant, which is adversely affected as becomes larger with respect to . Henceforth, we shall evaluate only Simp and kLev ignoring SimpO and kLevO as they do not seem to reveal any extra information. Following are the most surprising observations of this experiment:

Observation 1.

The advanced algorithms perform extremely well for from to ,
(a) kPath requires the passes (minimum) until and passes henceforth.
(b) kLev requires merely passes which gradually increase to .

Figure 5: For various algorithms, the plot shows the number of passes required to build a DFS tree of a random graph with for for different values of up to . The advanced algorithms are allowed to store (1) edges, and (2) edges.

We now compare in Figure 5, the expected number of passes taken by different algorithms for random graphs having vertices, for different values of up to . The number of passes required by Simp decreases sharply as the graph becomes denser. This can be explained by the additional heuristic of Simp, which has more opportunities to add vertices during a single pass in denser graphs. However, the performance of Imprv worsens sharply as the graph becomes denser. This is because the height of a DFS tree and hence the number of passes increases sharply with the density by the DFS height property. The advanced algorithms are evaluated using edges, for and . Notably, the performance of kPath worsens with the increase in density despite the fact that it exploits the depth of the auxiliary DFS tree to extract the longest path. However, recall that this auxiliary DFS tree is made using just the first edges for a component of size , which is clearly independent of the density of the graph. Moreover, the resulting components formed after having removed the longest path would be less in number and larger in size as the density of the graph increases, justifying more number of passes required by kPath. However, notice that performs better than , as clearly affects the depth of the auxiliary DFS tree and hence the length of the path added during a pass. Also, note that the passes required by kPath are minimum till little earlier than , as noticed in Observation 1 (a). After this, instead of increasing gradually (as its an expected value) with density as normally expected, the number of passes increases like a staircase having steps at , and so on. However, after around 100,000 edges, the number of passes increases gradually for both values of to its final value, not adhering to the staircase structure. Finally, kLev also has a surprising performance sharply rising up to