When it comes to graphs, traditional topological data analysis has focused mostly on undirected ones. However, applications in social networks [20, 2], brain networks , and others require processing directed graphs. Consequently, topological data analysis for these applications needs to be adapted accordingly to account for directedness. Recently, some work [5, 19] have initiated to address this important but so far neglected issue.
Since topological data analysis uses persistent homology as a main tool, one needs a notion of homology for directed graphs. Of course, one can forget the directedness and consider the underlying undirected graph as a simplicial -complex and use a standard persistent homology pipeline for the analysis. However, this is less than desirable because the important information involving directions is lost. Currently, there are two main approaches that have been proposed for dealing with directed graphs. One uses directed clique complexes [19, 9] and the other uses the concept of path homology . In the first approach, a -clique in the input directed graph is turned into a -simplex if the clique has a single source and a single sink. The resulting simplicial complex is subsequently analyzed with the usual persistent homology pipeline. One issue with this approach is that there could be very few cliques with the required condition and thus accommodating only a very few higher dimensional simplices. In the worst case, only the undirected graph can be returned as the directed clique complex if each 3-clique is a directed cycle. The second approach based on path homology alleviates this deficiency. Furthermore, certain natural functorial properties, such as Künneth formula, do not hold for the clique complex .
The path homology, originally proposed by Grigoryan, Lin, Muranov and Yau in 2012  and later studied by [12, 13, 6], has several properties that make it a richer mathematical structure. For example, there is a concept of homotopy under which the path homology is preserved; it accommodates Künneth formula; and the path homology theory is dual to the cohomology theory of digraphs introduced in . Furthermore, persistent path homology developed in  is shown to respect a stability property for its persistent diagrams.
To use path homologies effectively in practice, one needs efficient algorithms to compute them. In particular, we are interested in developing efficient algorithms for computing -dimensional path homology and its persistent version because even for this case the current state of the art is far from satisfactory: Given a directed graph with vertices, the most efficient algorithm proposed in  has a time complexity (more precisely, their algorithm takes to compute the ()-dimensional persistent path-homology).
The main contribution of this paper is stated in Theorem 1.1. The reduced time complexity of our algorithm can be attributed to the fact that we compute the boundary groups more efficiently. In particular, it turns out that for -dimensional path homology, the boundary group is determined by bigons, certain triangles, and certain quadrangles in the input directed graph. The bigons and triangles can be determined relatively easily. It is the boundary quadrangles whose computation and size determine the time complexity. The authors in  compute a basis of these boundary quadrangles by constructing a certain generating set for the 2-dimensional chain group by a nice column reduction algorithm (being different from the standard simplicial homology, it is non-trivial to do reduction for path homology). We take advantage of the concept of arboricity and related results in graph theory, together with other efficient strategies, to enumerate a much smaller set generating the boundary quadrangles. Computing the cycle and boundary groups efficiently both for non-persistent and persistent homology groups is the key to our improved time complexity.
Given a directed graph with vertices and edges, set , where is the so-called arboricity of , and and are the in-degree and out-degree of , respectively. There is an time algorithm for computing the -dimensional persistent path homology for where is the exponent for matrix multiplication111That is, the fastest algorithm to multiply two matrices takes time ., and is the inverse Ackermann function.
This also gives an time algorithm for computing the -dimensional path homology of .
In particular, for a planar graph , and the time complexity becomes .
The arboricity of a graph mentioned in Theorem 1.1 denotes the minimum number of edge-disjoint spanning forests into which G can be decomposed . It is known that in general, , but it can be much smaller. For example, for planar graphs and for a graph embedded on a genus- surface . Hence, for planar graphs, we can compute -dimensional persistent path homology in time whereas the algorithm in  takes time 222The original time complexity stated in the paper is for -dimensional case. However, one can improve it to by a more refined analysis for planar graphs..
Organization of the paper.
After characterizing the -dimensional path homology group in Section 3, we first propose a simple algorithm to compute it. In Section 4, we consider its persistent version and present an improved and more efficient algorithm. In Section 5, we develop an algorithm to compute -dimensional minimal path homology basis [8, 10], and also show experiments demonstrating the efficiency of our new algorithms.
We briefly introduce some necessary background for path homology. Interested readers can refer to  for more details. The original definition can be applied to structures beyond directed graphs; but for simplicity, we use directed graphs to introduce the notations.
Given a directed graph , we denote as the directed edge from to . A self-loop is defined to be the edge from to itself. Throughout this paper, we assume that does not have self-loops. We also assume that
does not have multi-edges, i.e. for every ordered pair, there is at most one directed edge from to . For notational simplicity, we sometimes use index to refer to vertex .
Let be a field with 0 and 1 being the additive and multiplicative identities respectively. We use to denote the additive inverse of in . An elementary -path on is simply a sequence of vertices in . We denote this path by . Let denote the -linear space of all linear combinations of elementary -paths with coefficients from . It is easy to check that the set is a basis for . Each element of is called a -path, and it can be written as
Similar to simplicial complexes, there is a well-defined boundary operator :
where means the omission of index . The boundary of a path , is thus . We set and note that is the set of -linear combinations of vertices in .
Lemma 2.4 in  shows that .
Next, we restrict to real paths in directed graphs. Specifically, given a directed graph , call an elementary -path allowed if there is an edge from to for all . Define as the space of all allowed -paths, that is, . An elementary -path is called regular if for all , and is irregular otherwise. Clearly, every allowed path is regular since there is no self-loop. However, the boundary map on may create a term resulting into an irregular path. For example, is irregular because of the term . To deal with this case, the term containing consecutive repeated vertices is identified with . Thus, for the previous example, we get . The boundary map on is taken to be the boundary map for restricted on with this modification: where all terms with consecutive repeated vertices created by the boundary map are replaced with ’s.
Unfortunately, after restricting to the space of allowed paths , the inclusion that may not hold any more; that is, the boundary of an allowed -path is not necessarily an allowed -path. To this end, we adopt a stronger notion of allowed paths: an allowed path is -invariant if is also allowed. Let be the space generated by all -invariant paths. We then have (as ). This gives rise to the following chain complex of -invariant allowed paths:
We can now define the homology groups of this chain complex. The -th cycle group is defined as , and elements in are called -cycles. The -th boundary group is defined as , with elements of being called -boundary cycles (or simply -boundaries). The resulting -dimensional path homology group is .
2.1 Examples of 1-boundaries
Below we give three examples of 1-boundaries; see Figure 1.
A bi-gon is a 1-cycle consisting of two edges and from ; see Figure 1(a). Consider the 2-path . We have that its boundary is . Since both and are allowed 1-paths, it follows that any bi-gon of is necessarily a 1-boundary.
Consider the 1-cycle of (it is easy to check that ). Now consider the 2-path : its boundary is then . Note that every summand in the boundary is allowed. Thus is a 1-boundary. We call any triangle in isomorphic to a boundary triangle. Note that a boundary triangle always has one sink and one source; see the source and sink in Figure 1(b). In what follows, we use to denote a boundary triangle where is the source and is the sink.
Consider the 1-cycle from . It is easy to check that is the boundary of the -path , as . We call any quadrangle isomorphic to a boundary quadrangle.
In the remainder of the paper, we use to represent a quadrangle; i.e, a 1-cycle consisting of 4 edges whose undirected version has the form . (Note that a quadrangle may not be a boundary quadrangle). We denote a boundary quadrangle by , where and are the source and sink of this boundary quadrangle respectively.
3 Computing -dimensional path homology
Note that the -dimensional -invariant path space is the space generated by all edges  because the boundary of every edge is allowed by definition.
Now consider the -cycle group ; that is, is the kernel of applied to . We show below that a basis of can be computed by considering a spanning tree of the undirected version of , which denoted by . This is well known when is . It is easy to see that this spanning tree based construction also works for arbitrary field .
Specifically, let be a rooted spanning tree of with root , and . For every edge , let be the -cycle (under ) obtained by summing and all edges on the paths and between and , and and respectively. The cycles form a basis of -cycle group of under coefficient. Now for every such cycle in , we also have a cycle in containing same edges with which are assigned a coefficient or depending on their orientations in . We call this -cycle in also . Then we have the following proposition.
The cycles in form a basis for under any coefficient field .
First, it is obvious that the cycles are independent because every cycle contains a unique edge that does not exist in other cycles, which means any cycle cannot written as linear combination of other cycles in the set. Now what remains is to show that the cycles generate . Consider any -cycle in with coefficients in . Observe that must have at least one edge from because otherwise it will have non-zero coefficients only on edges in whose boundary cannot be . Consider an edge with non-zero coefficient in . The cycle has with zero coefficient. If is not , continue the argument again and we are guaranteed to derive a null cycle ultimately because every time we make the coefficient of an edge belonging to zero. This means that we have for some . In other words . It immediately follows that form a basis for . ∎
Now, we show a relation between -dimensional homology, cycles, bigons, triangles and quadrangles. Recall that bi-gons, boundary triangles and boundary quadrangles are specific types of -dimensional boundaries with two, three or four vertices, respectively; see Section 2.1. The following theorem is similar to Proposition 2.9 from , where the statement there is under coefficient ring . For completeness, we include the (rather similar) proof for our case in Appendix A.
Let be a directed graph. Let denote the space generated by all boundary triangles, boundary quadrangles and bi-gons in . Then we have .
The -dimensional path homology group satisfies that .
3.1 A simple algorithm
Theorem 3.1 and Corollary 3.1 provide us a simple framework to compute . Below we only focus on the computation of the rank of ; but the algorithm can easily be modified to output a basis for as well. Later in Section 4, we will develop a more efficient and sophisticated algorithm for the -dimensional persistent path homology , which as a by-product, also gives a more efficient algorithm to compute .
In the remaining of this paper, we represent each cycle in
with a vector. Assume all edges are indexed fromto as where is the number of edges. Then, each 1-cycle is an -dimensional vector, where records the coefficient for edge in .
(Step 1): cycle group .
By Proposition 3.1, for directed graph . The computation of the rank takes time (or time if we need to output a basis of it explicitly).
(Step 2): boundary group .
Note that by Theorem 3.1, we can compute the set of all bigons, boundary triangles and boundary quadrangles as a generating set of 1-boundary cycles (meaning that it generates the boundary group ) for (Step 2.a). However, such a set could have size even for a planar graph, where ; see Figure 2. (For a general graph, the number of boundary quadrangles could be .)
To make (Step 2.b) efficient, we wish to have a generating set of 1-boundary cycles with small cardinality. To this end, we leverage a classical result of  to reduce the size of .
Given an undirected graph where the number of multi-edges between any two vertices is constant, its arboricity is the minimum number of edge-disjoint spanning forests which G can be decomposed into . An alternative definition is
From this definition, it is easy to see (and well-known, see e.g, ) that:
Given an undirected graph where the number of multi-edges between any two vertices is constant:
(1). If is a planar graph, or a graph with bounded vertex degrees, then .
(2). If is a graph embedded on a genus surface, then .
(3). In general, if does not contain self-loops, then .
We will leverage some classical results from . First, to represent quadrangles, we use the following triple-list representation  : a triple-list means that for each , is adjacent to both and , where we say and are adjacent if either or are in (i.e, and are adjacent when disregarding directions). Given such a triple-list , it is easy to see that form the consecutive vertices of a quadrangle in the undirected version of graph ; and we also say that the undirected quadrangle is covered by this triple-list. We say that a vertex is in a triple-list if it is in the set .
The size of a triple-list is the total number of vertices contained in it. This triple-list thus represents number of undirected quadrangles in succinctly with size.
Proposition 3.2 ().
(1) Let be a connected undirected graph with vertices and edges. There is an algorithm listing all the triangles in in time.
(2) There is an algorithm to compute a set of triple-lists which covers all quadrangles in a connected graph G in time. The total size complexity of all triple-lists is .
Using the above result, we can have the following theorem.
Let be a directed graph with vertices and edges. We can compute a generating set of 1-boundary cycles for with cardinality in time .
We will show that we can generate all bigons, boundary triangles and boundary quadrangles, via a generating set satisfying the requirement in the theorem.
First, note that the total number of bigons are since there are directed edges and for each edge, there is at most one bi-gon created as there is no multi-edge in . Next, by Proposition 3.2, we have that the total number of undirected triangles is , which further bounds the total number of directed triangles, as well as that of the boundary triangles by . In particular, we enumerate all undirected triangles in in time, and if the vertices form a boundary triangle, we add it to the generating set . This adds all number of boundary triangles to in time.
The case for quadrangles is more involved. The total number of quadrangles could be and we aim to find a subset of that generate all boundary quadrangles to add to .
By Proposition 3.2, in time, we can compute a list of triple-lists with total size complexity that generate all undirected quadrangles. Note that this also implies that .
Now for a triple-list , we will generate three lists as follows (see Figure 3):
- Type-1 list:
a triple-list where is in if and only if , and both edges and are in . We say that a boundary quadrangle can be generated by the type-1 list if it is of the form , with .
- Type-2 list:
a triple-list where is in if any only if is in , and the edges and are in . We say that a boundary quadrangle can be generated by the type-2 list if it is of the form , with .
- Type-3 list:
a so-called quadruple-list , where each and are in , and edges , , and are in , for . We say that a boundary quadrangle can be generated by the type-3 list if it is of the form , with and .
It is easy to see that . Let denote all lists generated by triples in . Note that total size complexity of (which is the sum of the size of each triple-list or quadruple-list in ) is still .
Denote as the set of all boundary quadrangles in ; note that by definition, each element in is a 1-boundary. On the other hand, each boundary quadrangle in is generated by some list in . Indeed, given an arbitrary boundary quadrangle , By Proposition 3.2, we know that its undirected version must be covered by some triple-list . If and , then the boundary quadrangle is then generated by the type-1 list . If and , then it is generated by the type=2 list . Otherwise, by the definition of covering the undirected quadrangle , it must mean that or . Thus this boundary quadrangle is generated by the type-3 list . This proves that all boundary quadrangles in are generated by the lists in the set .
Finally, we will add boundary quadrangles to as follows: We inspect each list in .
- Case 1
: If it is a type-1 list, say of the form , we add number of boundary quadrangles of the form to for each .
- Case 2
: If it is a type-2 list of the form , we then add number of boundary quadrangles of the form to , for each .
- Case 3
: If it is a type-3 list of the form , then we add number of boundary quadrangles of the form or into , for each and .
Note that for each case above, the subset of quadrangles we add to can generate all the boundary quadrangles generated by the corresponding list for or . Indeed, for (Case 1): each boundary quadrangle generated by can be written as the linear combination of and , both of which are added to . (Case 2) can be argued in a symmetric manner. For (Case 3), consider a boundary quadrangle generated by . It can be written as the combination of , and , all of which on the righthand side are added to .
Hence in summary, the set of quadrangles we add to will generate all boundary quadrangles . Furthermore, note that in each case, the number of quadrangles we add to is linear in the size of the list from being considered. Hence the total number of boundary quadrangles ever added to is bounded by the total size of which is further bounded by . Putting everything together, the theorem then follows. ∎
It then follows from Theorem 3.1 that (Step 2.a) can be implemented in time, producing a generating set of cardinality . Finally, representing each boundary cycle in as a vector of dimension , we can then compute the rank of cycles in in , where is the exponent for matrix multiplication .
Putting everything together, we have that
Given a directed graph with and , Algorithm 1 computes the rank of the -dimensional path homology group in time. The algorithm can be extended to compute a basis for with the same time complexity.
4 Computing persistent path homology
The concept of arboricity used in the previous section does not consider edge directions. Indeed, our algorithm to compute a generating set as given in the proof of Theorem 3.2 first computes a (succinct) representation of all quadrangles, whether they contribute to boundary quadrangles or not. On the other hand, as Figure 4 illustrates, a graph can have no boundary quadrangle despite the fact that the graph is dense (with edges and thus arboricity). Another way to view this is that the example has no allowed -path, and thus no -invariant 2-paths and consequently no 1-boundary cycles. Our algorithm will be more efficient if it can also respect the number of allowed elementary -paths.
In fact, a more standard and natural way to compute a basis for the 1-boundary group proceeds by taking the boundary of -invariant 2-paths. The complication is that unlike in the simplicial homology case, it is not immediately evident how to compute a basis for (the space of -invariant 2-paths). Nevertheless, Chowdhury and Mémoli presented an elegant algorithm to show that a basis for (and ) can still be computed using careful column-based matrix reductions . The time complexity of their algorithm is which depends on the number of elementary -paths 333The time complexity given in the paper  assumes that the input directed graph is complete, and takes to compute . However, a more refined analysis of their time complexity shows that it can be improved to . .
In this section, we present an algorithm that can take advantage of both of the previous approaches (the algorithm of  and Algorithm 1). Similar to , we will now consider the persistent path homology setting, where we will add directed edges in one by one incrementally. Hence our algorithm can compute the persistent w.r.t. a filtration. However different from , instead of reducing a matrix with columns corresponding to all elementary allowed 2-paths, we will follow a similar idea as in Algorithm 1 and add a generating set of boundary cycles each time we consider a new directed edge.
4.1 Persistent path homology
We now introduce the definition of the persistent path homology . The persistent vector space is a family of vector spaces together with linear maps so that: (1) is the identity for every ; and (2) for each .
Let be a weighted directed graph where is the vertex set, is the edge set, and is the weight function . For every , a directed graph can be constructed as . This gives rise to a filtration of graphs using the natural inclusion map .
 The -dimensional persistent path homology of a weighted directed graph is defined as the persistent vector space The -dimensional path persistence diagram of is the persistence diagram of .
To compute the path homology of an unweighted directed graph , we can order edges in arbitrarily with the index of an edge in this order being its weight. The rank of can then be retrieved from the -dimensional persistent homology group induced by this filtration by considering only those homology classes that “never die”.
4.2 A more efficient algorithm
In what follows, to simplify presentation, we assume that we are given a directed graph , where edges are already sorted in increasing order of their weights. Let denote the subgraph of spanned by the edges ; and set . We now present an algorithm to compute the -dimensional persistent path homology induced by the nesting sequence . In particular, in Algorithm 2, as we insert each new edge , moving from to , we maintain a basis for and for , updated from and and output new persistent pairs. On a high level, this algorithm follows the standard procedure in .
4.2.1 Procedure GenSet.
Note that is obtained from by inserting a new edge to . At this point, we have already maintained a basis for . Our goal is to compute a set of generating boundary cycles such that contains a basis for .
We first inspect the effect of adding edge to . Two cases can happen:
(Case-A): The endpoints and are in different connected components in (the undirected version of) , and after adding , those two components are merged into a single one in . In this case, no cycle is created, nor does the boundary group change. Thus and . We say that edge is negative in this case (as it kills in ).
The algorithm maintains the set of negative edges seen so far, which is known to form a spanning forest of . (Here, we abuse the notation slightly and say that a set of directed edges span a tree for a set of vertices if they do so when directions are ignored.) The algorithm maintains via a union-find data structure.
(Case-B): The endpoints and are already in the same connected component in . After adding this edge , new cycles are created in . Hence is positive in this case (as it creates an element in ; although different from the standard simplicial homology, it may not necessarily create an element in as we will see later).
Whether is positive or negative can be easily determined by performing two Find operations in the union-find data structure representing . A Union() operation is performed to update to if is negative.
We now describe how to handle (Case-B). After adding edge , multiple cycles containing can be created in . Nevertheless, by Proposition 3.1, the dimension of increases only by . On the other hand, the addition of may create new boundary cycles. Interestingly, it could increase the rank of by more than . See Figure 5 for an example where increases by ; and note that this number can be made arbitrarily large.
As mentioned earlier, in this case, we wish to compute a set of generating boundary cycles such that contains a basis for .
Similar to Algorithm 1, using Theorem 3.1, we choose some bigons, boundary triangles and boundary quadrangles and add them to . In particular, since only accounts for the newly created boundary cycles, we only need to consider bigons, boundary triangles and boundary quadrangles that contain . We now describe the construction of , which is initialized to be .
(i) Bigons. At most one bigon can be created after adding (namely, the one that contains ). We add it to if this bigon exists.
(ii) Boundary triangles. There could be three types of newly created boundary triangles containing . The first case is when is the source and is the sink; see Figure 6(a). In this case multiple 2-paths may exist from to , , forming multiple boundary triangles of this type containing . However, we only need to add one triangle of them into , say since every other triangle can be written as a linear combination of and an existing boundary quadrangle in .
For the second case (see Figure 6(b)) where is the source but is not the sink, we include all such boundary triangles to . We also add all boundary triangles of the last type in which is the sink but is not the source to ; see Figure 6(c). It is easy to see that can generate all new boundary triangles containing .
(iii) Boundary quadrangles. Given an edge , there are two types of the boundary quadrangles incident to it: one has as the source; the other has as the sink. We focus on the first case; see Figure 7(a). The second case can be handled symmetrically.
In particular, we will first compute a set and then select a subset of quadrangles from for adding to .
Specifically, take any successor of , that is, there is an edge forming an allowed 2-path in . Before introducing the edge , there may be multiple allowed 2-paths in ; see Figure 7 (a). For each such 2-path , a new boundary quadrangle containing will be created. However, among all such 2-paths , we will pick just one 2-path, say and only add the quadrangle formed by and to . Observe that any other boundary quadrangle containing 2-path , say , can be written as a linear combination of the quadrangle and boundary quadrangle which is already in (and in the span of which is a basis for ). In other words, generates any other boundary quadrangle containing 2-path .
We perform this for each successor of . Hence this step adds at most number of boundary quadrangles to the set .
Not all quadrangles in will be added to . In particular, suppose we have quadrangles incident to the newly inserted edge as well as another vertex , i.e. there are edges and , ; see Figure 7 (b). If there does not exist any other vertex such that edges , then we add all quadrangles in to . If this is not the case, let be another vertex such that and are already in ; see Figure 7 (b). In this case, we only add one quadrangle from set , say, to the generating set .
It is easy to check that any other quadrangle , , can be written as the combination of , and . As the latter two quadrangles are boundary quadrangles from , they can already be generated by . The entire process takes time