Efficient Implementation of Color Coding Algorithm for Subgraph Isomorphism Problem

08/29/2019 ∙ by Josef Malík, et al. ∙ Czech Technical University in Prague 0

We consider the subgraph isomorphism problem where, given two graphs G (source graph) and F (pattern graph), one is to decide whether there is a (not necessarily induced) subgraph of G isomorphic to F. While many practical heuristic algorithms have been developed for the problem, as pointed out by McCreesh et al. [JAIR 2018], for each of them there are rather small instances which they cannot cope. Therefore, developing an alternative approach that could possibly cope with these hard instances would be of interest. A seminal paper by Alon, Yuster and Zwick [J. ACM 1995] introduced the color coding approach to solve the problem, where the main part is a dynamic programming over color subsets and partial mappings. As with many exponential-time dynamic programming algorithms, the memory requirements constitute the main limiting factor for its usage. Because these requirements grow exponentially with the treewidth of the pattern graph, all existing implementations based on the color coding principle restrict themselves to specific pattern graphs, e.g., paths or trees. In contrast, we provide an efficient implementation of the algorithm significantly reducing its memory requirements so that it can be used for pattern graphs of larger treewidth. Moreover, our implementation not only decides the existence of an isomorphic subgraph, but it also enumerates all such subgraphs (or given number of them). We provide an extensive experimental comparison of our implementation to other available solvers for the problem.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many real-world domains incorporate large and complex networks of interconnected units. Examples include social networks, the Internet, or biological and chemical systems. These networks raise interesting questions regarding their structure. One of those questions asks whether a given network contains a particular pattern, which typically represents a specific behaviour of interest [1, 7, 15]. The problem of locating a particular pattern in the given network can be restated as a problem of locating a subgraph isomorphic to the given pattern graph in the network graph.

Formally, the Subgraph Isomorphism (SubIso) problem is, given two undirected graphs  and , to decide whether there is a (not necessarily induced) subgraph of  isomorphic to . Or, in other words, whether there is an adjacency-preserving injective mapping from vertices of to vertices of . Since we do not require the subgraph to be induced (or the mapping to preserve non-adjacencies), some authors call this variant of the problem Subgraph Monomorphism.

For many applications it is not enough to just learn that the pattern does occur in the network, but it is necessary to actually obtain the location of an occurrence of the pattern or rather of all occurrences of the pattern [21, 27]. Because of that, we aim to solve the problem of subgraph enumeration, in which it is required to output all subgraphs of the network graph isomorphic to the pattern graph. In Subgraph Enumeration (SubEnum), given again two graphs and , the goal is to enumerate all subgraphs of  isomorphic to . Note, that SubEnum is at least as hard as SubIso. We call the variants, where the problem is required to be induced IndSubIso and IndSubEnum, respectively.

As Clique, one of the problems on the Karp’s original list of 21 NP-complete problems [18], is a special case of SubIso, the problem is NP-complete. Nevertheless, there are many heuristic algorithms for SubEnum, many of them based on ideas from constraint programming (see Section 1.1), which give results in reasonable time for most instances. However, for each of them there are rather small instances which they find genuinely hard, as pointed out by McCreesh et al. [25]. Therefore, developing an alternative approach that could possibly cope with these hard instances would be of interest.

In this paper we focus on the well known randomized color coding approach [2], which presumably has almost optimal worst case time complexity. Indeed, its time complexity is with memory requirements of , where  and  denote the number of vertices in the network graph  and the pattern graph , respectively, and is the treewidth of graph —a measure of tree-likeness (see Section 1.2 for exact definitions). Moreover, we presumably cannot avoid the factor exponential in treewidth in the worst case running time, as Marx [23] presented an ETH111Exponential Time Hypothesis [17]-based lower bound for Partitioned Subgraph Isomorphism problem.

Proposition 1 (Marx [23])

If there is a recursively enumerable class  of graphs with unbounded treewidth, an algorithm , and an arbitrary function  such that  correctly decides every instance of Partitioned Subgraph Isomorphism with the smaller graph  in  in time , then ETH fails.

As the memory requirements of the color coding approach grow exponentially with treewidth of the pattern graph, existing implementations for subgraph enumeration based on this principle restrict themselves to paths [16] or trees [27], both having treewidth 1. As the real world applications might employ networks of possibly tens to hundreds of thousands of vertices and also pattern graphs with structure more complicated than trees, we need to significantly reduce the memory usage of the algorithm.

Using the principle of inclusion-exclusion, Amini et al. [3, Theorem 15] suggested a modification of the color coding algorithm, which can decide whether the pattern occurs in the graph in expected time with memory requirements reduced to .222While the formulation of Theorem 15 in [3] might suggest that the algorithm actually outputs a witnessing occurrence, the algorithm merely decides whether the number of occurrences is non-zero (see the proof of the theorem). While single witnessing occurrence can be found by means of self-reduction (which is complicated in case of randomized algorithm), the inclusion-exclusion nature of the algorithm does not allow to find all occurrences of pattern in the graph, which is our main goal.

Therefore, our approach rather follows the paradigm of generating only those parts of a dynamic programming table that correspond to subproblems with a positive answer, recently called “positive instance driven” approach [30]. This further prohibits the use of the inclusion-exclusion approach of Amini et al. [3], since the inclusion-exclusion approach tends to use most of the table and the term is itself prohibitive in the memory requirements for .

Because of the time and memory requirements of the algorithm, for practical purposes we restrict ourselves to pattern graphs with at most  vertices.

Altogether, our main contribution is twofold:

  • We provide a practical implementation of the color coding algorithm of Alon, Yuster, and Zwick [2] capable of processing large networks and (possibly disconnected) pattern graphs of small, yet not a priory bounded, treewidth.

  • We supply a routine to extract the occurrences of the subgraphs found from a run of the algorithm.

It is important to note that all the modifications only improve the practical memory requirements and running time. The theoretical worst case time and space complexity remain the same as for the original color coding algorithm and the algorithm achieves these, e.g., if the network graph is complete. Also, in such a case, there are occurrences of the pattern graph in the network implying a lower bound on the running time of the enumeration part.

In Section 2 we describe our modifications to the algorithm and necessary tools used in the process. Then, in Section 5, we benchmark our algorithm on synthetic and realistic data and compare its performance with available existing implementations of algorithms for subgraph isomorphism and discuss the results obtained. Section 6 presents future research directions.

1.1 Related work

There are several algorithms tackling SubIso and its related variants. Some of them only solve the variant of subgraph counting, our main focus is however on algorithms actually solving SubEnum. Following Carletti et al. [10] and Kimmig et al. [19], we categorize the algorithms by the approach they use (see also Kotthoff et al. [20] for more detailed description of the algorithms). Many of the approaches can be used both for induced and non-induced variants of the problem, while some algorithms are applicable only for one of them.

Vast majority of known algorithms for the subgraph enumeration problem is based on the approach of representing the problem as a searching process. Usually, the state space is modelled as a tree and its nodes represent a state of a partial mapping. Finding a solution then typically resorts to the usage of DFS in order to find a path of mappings in the state space tree which is compliant with isomorphism requirements. The efficiency of those algorithms is largely based on early pruning of unprofitable paths in the state space. Indeed, McCreesh et al. [25] even measure the efficiency in the number of generated search tree nodes. The most prominent algorithms based on this idea are Ullmann’s algorithm [31], VF algorithm and its variants [8, 10, 12, 13] (the latest VF3 [8] only applies to IndSubEnum) and RI algorithm [7]. The differences between these algorithms are based both on employed pruning strategies and on the order in which the vertices of pattern graph are processed (i.e. in the shape of the state space tree).

Another approach is based on constraint programming, in which the problem is modelled as a set of variables (with respective domains) and constraints restricting simultaneous variable assignments. The solution is an assignment of values to variables in a way such that no constraint remains unsatisfied. In subgraph isomorphism, variables represent pattern graph vertices, their domain consists of target graph vertices to which they may be mapped and constraints ensure that the properties of isomorphism remain satisfied. Also in this approach, a state space of assignments is represented by a search tree, in which non-profitable branches are to be filtered. Typical algorithms in this category are LAD algorithm [28], Ullmann’s bitvector algorithm [32], and Glasgow algorithm [24]. These algorithms differ in the constraints they use, the way they propagate constraints, and in the way they filter state space tree.

There are already some implementations based on the color coding paradigm, where the idea is to randomly color the input graph and search only for its subgraphs, isomorphic to the pattern graph, that are colored in distinct colors (see Section 2.1 for more detailed description). This approach is used in subgraph counting algorithms, e.g., in ParSE [33], FASCIA [26], and in [1], or in algorithms for path enumeration described in [27] or in [16]. Each of these algorithms, after the color coding step, tries to exploit the benefits offered by this technique in its own way; although usually a dynamic programming sees its use. Counting algorithms as ParSE and FASCIA make use of specifically partitioned pattern graphs, which allow to use combinatorial computation. Weighted path enumeration algorithms [27, 16] describe a dynamic programming approach and try to optimize it in various ways. However, to the best of our knowledge there is no color coding algorithm capable of enumerating patterns of treewidth larger than 1.

Our aim is to make step towards competitive implementation of color coding based algorithm for SubEnum, in order to see, where this approach can be potentially beneficial against the existing algorithms. To this end, we extend the comparisons of SubEnum algorithms [9, 20, 25] to color coding based algorithms, including the one proposed in this paper.

1.2 Basic definitions

All graphs in this paper are undirected and simple. For a graph  we denote  its vertex set,  the size of this set,  its edge set, and  the size of its edge set.

As already said, we use the color coding algorithm. The algorithm is based on a dynamic programming on a nice tree decomposition of the pattern graph. We first define a tree decomposition and then its nice counterpart.

Definition 1

A tree decomposition of a graph  is a triple , where  is a tree rooted at node  and is a mapping satisfying: (i) ; (ii) , such that ; (iii) the nodes form a connected subtree of .

We shall denote bag  as . The width of tree decomposition is . Treewidth of graph  is the minimal width of a tree decomposition of  over all such decompositions.

Definition 2

A tree decomposition of a graph  is nice if , , and each node is of one of the following four types:

  • Leaf node has no children and ;

  • Introduce node has exactly one child  and for some ;

  • Forget node has exactly one child  and for some ;

  • Join node has exactly two children  and .

Note that for practical purposes, we use a slightly modified definition of nice tree decomposition in this paper. As the algorithm starts the computation in a leaf node, using the standard definition with empty bags of leaves [14] would imply that the tables for leaves would be somewhat meaningless and redundant. Therefore, we make bags of leaf nodes contain a single vertex.

Definition 3

For a tree decomposition , we denote by  the set of vertices in  and in  for all descendants  of  in . Formally .

Note that, by Definition 3, for the root  of  we have and .

2 Algorithm Description

In this section we first briefly describe the idea of the original color coding algorithm [2], show, how to alter the computation in order to reduce its time and memory requirements, and describe implementation details and further optimizations of the algorithm.

2.1 Idea of the Algorithm

The critical idea of color coding is to reduce the problem to its colorful version. For a graph  and a pattern graph , we color the vertices of  with exactly  colors. We use the randomized version, i.e., we create a random coloring . After the coloring, the algorithm considers as valid only subgraphs  of  that are colorful copies of  as follows.

Definition 4

Subgraph  of a graph  is a colorful copy of  with respect to coloring , if  is isomorphic to  and all of its vertices are colored by distinct colors in .

As the output of the algorithm heavily depends on the chosen random coloring of 

, in order to reach some predefined success rate of the algorithm, we need to repeat the process of coloring several times. The probability of a particular occurrence of pattern graph 

becoming colorful with respect to the random coloring is , which tends to for large . Therefore, by running the algorithm times, each time with a random coloring , the probability that an existing occurrence of the pattern will be revealed in none of the runs is at most . While using more colors can reduce the number of iterations needed, it also significantly increases the memory requirements. Hence, we stick to colors. Even though it is possible to derandomize such algorithms, e.g., by the approach shown in [14], in practice the randomized approach usually yields the results much quicker, as discussed in [27]. Moreover, we are not aware of any actual implementation of the derandomization methods.

The main computational part of the algorithm is a dynamic programming. The target is to create a graph isomorphism . We do so by traversing the nice tree decomposition of the pattern graph  and at each node of the tree decomposition, we construct possible partial mappings with regard to required colorfulness of the copy. Combination of partial mappings consistent in colorings then forms a desired resulting mapping.

The semantics of the dynamic programming table is as follows. For any tree decomposition node , any partial mapping and any color subset , we define if there is an isomorphism  of to a subgraph  of  such that:

  1. for all ;

  2. is a colorful copy of using exactly the colors in , that is, and is injective on .

If there is no such isomorphism, then we let . We denote all configurations for which as nonzero configurations.

The original version of the algorithm is based on top-down dynamic programming approach with memoization of already computed results. That immediately implies a big disadvantage of this approach—it requires the underlying dynamic programming table (which is used for memoization) to be fully available throughout the whole run of the algorithm. To avoid this inefficiency in our modification we aim to store only nonzero configurations, similarly to the recent “positive instance driven” dynamic programming approach [30].

3 Obtaining a Nice Tree Decomposition

The algorithm requires a nice tree decomposition of the pattern graph for its work. The running time of the algorithm actually does not depend on the treewidth of the graph, but rather on the width of the tree decomposition supplied. On one hand, for a given graph  and an integer , the problem of determining whether the treewidth of  is at most , is NP-complete [4]. On the other hand, as the main algorithm is exponential in the size of the pattern anyway, we can afford to use exponential time algorithms in order to obtain a tree decomposition of minimum possible width. We employ known technique based on graph triangulation, chordal graphs and elimination orderings, combining the ideas of Bodlaender et al. [5, 6]. Time complexity of this algorithm is .

After obtaining tree decomposition of the pattern graph, we construct a nice tree decomposition as described by Cygan et al. [14]—any tree decomposition of a graph  that consists of nodes with width at most , can be, in time, transformed to a nice tree decomposition of  with nodes and width bounded by . Our further modification to the nice tree decomposition does not affect the running time.

3.1 Initial Algorithm Modification

In our implementation, we aim to store only nonzero configurations, therefore we need to be able to construct nonzero configurations of a parent node just from the list of nonzero configurations in its child/children.

We divide the dynamic programming table  into lists of nonzero configurations, where each nice tree decomposition node has a list of its own. Formally, for every node , let us denote by  a list of all mappings  with a list of their corresponding color sets , for which . The list  for all is, in terms of contained information, equivalent to maintaining the whole table —all configurations not present in the lists can be considered as configurations with a result equal to zero.

Dynamic Programming Description

We now describe how to compute the lists for each type of a nice tree decomposition node.

For a leaf node , there is only a single vertex  in  to consider. We can thus map  to all possible vertices of , and we obtain a list with  partial mappings , in which the color list for each mapping contains a single color set .

For an introduce node and its child  in , we denote by  the vertex being introduced in , i.e., . For all nonzero combinations of a partial mapping and a color set in the list , we try to extend  by all possible mappings of the vertex  to the vertices of . We denote one such a mapping as . We can consider mapping  as correct, if (i) the new mapping  of the vertex  extends the previous colorset , that is, , and (ii) is edge consistent, that is, for all edges between currently mapped vertices, i.e., in our case , there must be an edge . However, because  was by construction already edge consistent, it suffices to check the edge consistency only for all edges in with  as one of their endpoints, i.e., for all edges with . After checking those two conditions, we can add to .

For a forget node and its child  in , there is not much work to do, as in the bottom-up construction, we directly obtain the parent list of nonzero configurations. We denote by , the vertex being forgotten in , i.e., . In this case, for all partial mappings  in the list , we create a new mapping  that excludes the mapping  for vertex , i.e., . Color sets corresponding to  are carried over to , as they represent colors already used in the construction. After this step, we might need to merge color lists of previously different mappings, as after the removal of the mapping , they might become the same mappings.

For a join node , we denote by  and  its children in . We traverse the children lists  and  and look for partial mappings  and , for which holds. Such mappings are the only ones to potentially form a new nonzero configuration in the parent list, as due to the fact that , we construct the new partial mapping  as . However, for each such mapping , we must also construct the new list of color sets, which would afterwards be corresponding to the mapping in the parent list. We do that by traversing color lists corresponding to mappings  and  in  and , respectively, and for particular sets  and  from the color lists of  and  construct a new color set . We must check, whether the intersection of  and  contains exactly the colors to color the vertices in . That is, for mapping , we add to  a color set , if .

Because we build the result from the leaves of the nice tree decomposition, we employ a recursive procedure on its root, in which we perform the computations in a way of a post-order traversal of a tree. From each visited node, we obtain a bottom-up dynamic programming list of nonzero configurations. After the whole nice tree decomposition is traversed, we obtain a list of configurations, that were valid in its root. Such configurations thus represent solutions found during the algorithm, from which we afterwards reconstruct results. Note that as we prepend a root with no vertices in its bag to the nice tree decomposition, there is a nonzero number of solutions if and only if, at the end of the algorithm, the list  contains a single empty mapping using all colors.

3.2 Further Implementation Optimizations

Representation of Mappings

For mapping representation, we suppose that the content of all bags of the nice tree decomposition stays in the same order during the whole algorithm. This natural and easily satisfied condition allows us to represent a mapping in a nice tree decomposition node  simply by an ordered tuple of  vertices from . From this, we can easily determine which vertex from  is mapped to which vertex in . Also, for a mapping in an introduce or a forget node, we can describe a position in the mapping, on which the process of introducing/forgetting takes place.

Representation of Color Sets

We represent color sets as bitmasks, where the -th bit states whether color  is contained in the set or not. For optimization purposes, we represent bitmasks with an integer number. As we use  colors in the algorithm and restricted ourselves to pattern graphs with at most vertices, we represent a color set with a -bit number.

Compressing the Lists

Because we process the dynamic programming lists one mapping at a time, we store these lists in a compressed way and decompress them only on a mapping retrieval basis. We serialize dynamic programming lists into a simple buffer of bytes. We store a dynamic programming list as a continuous group of records, each of which represents one partial mapping and its corresponding list of color sets. Each record contains:

  • a mapping in the form of ordered tuple of vertices,

  • the number of color sets included,

  • and color sets corresponding to the mapping in the form of non-negative integer numbers.

While deserializing, the number of vertices in a mapping can be easily determined by the size of the bag of a particular node.

As there is no requirement on the order of the color sets, we store these sets sorted in the increasing order of their number representation. This allows us to effectively use delta compression. Moreover, we use variable length encoding to store the numbers into buffer.

For several routines we use the LibUCW library,333LibUCW is downloadable from http://www.ucw.cz/libucw/. a C language library highly optimized for performance. The employed routines include data structures like growing buffers, hash tables, red-black trees, fast sorters, fast buffered input/output, and also the efficient variable length encoding of integers.

Masking Unprofitable Mappings

Our implementation supports an extended format of input graphs where one can specify for each vertex of the network, which vertices of the pattern can be mapped to it. This immediately yields a simple degree-based optimization. Before the run of the main algorithm, we perform a linear time preprocessing of input graphs and only allow a vertex to be mapped to a vertex if .

Mapping Expansion Optimizations

The main “brute-force” work of the algorithm is performed in two types of nodes—leaf and introduce nodes, as we need to try all possible mappings of a particular vertex in a leaf node or all possible mappings of an introduced vertex in a introduce node to a vertex from . We describe ways to optimize the work in introduce nodes in this paragraph.

Let  be an introduce node, the vertex introduced and  a mapping from a nonzero configuration for the child of . We always need to check whether the new mapping of  is edge consistent with the mapping  of the remaining vertices for the corresponding bag, i.e., whether all edges of  incident on  would be realized by an edge in . Therefore, if  has any neighbors in , then a vertex of  is a candidate for the mapping of  only if it is a neighbor of all vertices in the set , i.e., the vertices of , where the neighbors of  in  are mapped. Hence, we limit the number of candidates by using the adjacency lists of the already mapped vertices.

In the case we have to use different approach. The pattern graphs  tend to be smaller than the input graphs  by several orders of magnitude. Hence, if the introduced vertex is in the same connected component of  as some vertex already present in the bag, a partial mapping processed in an introduce node anchors the possible resulting component to a certain position in . During the construction of possible mapping of , it is useless to try mapping  to vertices in  that are very distant to the current position and, therefore, could by no means form a resulting subgraph isomorphic to the component of . We obtain the maximal possible distance to be considered in  as a minimal distance of  to a vertex in . I.e., we determine . Due to the limit on pattern graph size, we precompute distances between every pair of vertices by using BFS on each vertex before the start of the algorithm. Then it suffices to try vertices from  that are in distance at most from  in ; again, by a simple BFS usage.

Only if there is no vertex in the bag sharing a connected component of  with , we have to fall back to trying all possible mappings.

4 Result Reconstruction

It is usual in dynamic programming to store a witness or all witnesses for each reasonable value in the table. However, this would completely neglect the effect of compression of the lists and significantly increase the memory requirements.

Hence, to enumerate all results, we recursively traverse the computed dynamic programming lists  starting with the root  of underlying nice tree decomposition. The main idea is to ask the child  (or children ) of a node, how was a certain partial mapping  obtained during the computation. In other words, for each partial mapping  in  of interest, we will extract partial mappings  in  (or possibly also partial mappings  in ) that lead to the addition of  to . To preserve the colorfulness (and thus validity) of glued partial mappings, with a partial mapping  we also recursively pass a set of colors that can still be used in the choice of corresponding partial mappings to glue with . By applying this approach recursively and by gluing the partial information together in mappings, we obtain all possible ways an empty mapping in  could have been obtained—which is exactly the same as all possible ways a pattern graph can be mapped to the original graph. For the efficiency of the computation, we process each node only once and we thus recursively pass all partial mappings of interest and their remaining colorsets at once.

Formally, we will consider a recursive call of function , where , is a set of pairs , is a partial mapping and  is a color subset . This function returns a list  containing, for each pair in , a list of mappings that lead to the appearance of in .

In detail, the function proceeds as follows. If node  has a child , we first scan  for pairs , in which  and  and colorsets  and  are in some sense consistent. Pairs satisfying these conditions are then recursively passed to  in a list , from which we obtain a list . In case of a join node, we perform this process for both children. After that, we construct the resulting list from recursively computed list(s) by an addition of a mapping of the introduced vertex (in the case of a introduce node) or by all possible combinations of mappings (in the case of a join node). In the end, we sometimes need to flatten the resulting list, so all resulting partial mappings corresponding to a pair are in a single list . We now describe the process and consistency conditions formally for each of the node types.

For a leaf node , we simply return a list  of single element lists  for each pair . Each such list contains a partial mapping .

For an introduce node and its child  in , we denote by  the vertex being introduced in , i.e., . We create  consisting of by setting, for each pair , and . Note that must be in , as otherwise would not be present in  (or equivalently in ). Then we obtain . Let be constructed from and  be the corresponding list of . We construct the list  corresponding to by extending each by .

For a forget node and its child  in , we denote by  the vertex being forgotten in , i.e., . We add to  for each all from  such that and . Then we obtain . We construct  from  as follows. We do not need to modify any  in any , as the computation in forget node adds no new information to the mapping. However, as there may have been multiple (say ) pairs constructed from a single , we flatten all lists obtained from such pairs into a single list  corresponding to .

For a join node , we denote by  and  its children in . In this case, we create  and  consisting of and , respectively. For each , we find all and , such that , and and add them to  and , respectively. Then we obtain and . Let and , …, and be the pairs constructed from single and let and be the corresponding lists in  and , respectively. We obtain  corresponding to  as a union of lists , where  contains the union of  and  for each and each .

5 Experimental Results

The testing was performed on a 64-bit linux system with Intel Xeon CPU E3-1245v6@3.70GHz and 32GB 1333MHz DDR3 SDRAM memory. The module was compiled with gcc compiler (version 7.3.1) with -O3 optimizations enabled. Implementation and instances utilized in the testing are available at http://users.fit.cvut.cz/malikjo1/subiso/. All results are an average of 5 independent measurements.

We evaluated our implementation in several ways. Firstly, we compare available implementations on two different real world source graphs and a set of more-or-less standard target graph patterns. Secondly, we compare available implementations on instances from ICPR2014 Contest on Graph Matching Algorithms for Pattern Search in Biological Databases [11] with suitably small patterns. We also adapt the idea of testing the algorithms on Erdős-Rényi random graphs [25].

5.1 Algorithm Properties and Performance

In the first two subsection we used two different graphs of various properties as target graph . The first instance, Images, is built from an segmented image, and is a courtesy of [29]. It consists of vertices and edges. The second instance, Trans, is a graph of transfers on bank accounts. It is a very sparse network, which consists of vertices and undirected edges.

For the pattern graphs, we first use a standard set of basic graph patterns, as the treewidth of such graphs is well known and allows a clear interpretation of the results. In particular, we use paths, stars, cycles, an complete graphs on vertices, denoted , , , and with treewidth 1, 1, 2, and , respectively. We further used grids on vertices, with treewidth . Secondly, we use a special set of pattern graphs in order to demonstrate performance on various patterns. Patterns , , , and have , , , and vertices, , , , edges, and treewidth , , , and , respectively. Patterns , , and appear in both dataset, pattern in neither and pattern is disconnected. Description of these pattern graphs is shown in Table 1 and their illustration is shown in Figure 5.

Pattern Description
path on vertices
star on vertices
cycle on vertices
complete graph on vertices
grid graph on vertices
Pattern subgraph of both datasets with vertices, edges
Pattern subgraph of both datasets with vertices, edges
Pattern graph with vertices, edges
Pattern disconnected graph with vertices, edges
Table 1: Pattern graphs.

(a) Pattern

(b) Pattern

(c) Pattern

(d) Pattern
Figure 5: Pattern graph illustration

Due to randomization, in order to achieve some preselected constant error rate, we need to repeat the computation more than once. The number of found results thus depends not only on the quality of the algorithm, but also on the choice of the number of its repetitions. Hence, it is logical to measure performance of the single run of the algorithm. Results from such a testing, however, should be still taken as a rough average, because the running time of a single run of the algorithm depends on many factors.

Therefore, we first present measurements, where we average the results of many single runs of the algorithm. We average not only the time and space needed, but also the number of found subgraphs. To obtain the expected time needed to run the whole algorithm, it suffices to sum the time needed to create a nice tree decomposition and  times the time required for a single run, if there are  runs in total.

Pattern Comp. time [ms] Comp. memory [MB] Occurrences
240 12.73 3488.21
160 8.52 732.46
90 10.54 76.18
4 5.37 114.72
20 7.24 239.17
70 9.34 26.64
5 6.46 0
90 13.42 0
Pattern 80 9.14 292.48
Pattern 10 7.17 6.85
Pattern 10 5.30 0
Pattern 40 10.14 426.76
Table 2: Performance of a single run of the algorithm on Images dataset.
Pattern Comp. time [s] Comp. memory [MB] Occurrences
6.15 33.82 54572.94
0.32 34.40 562.54
0.17 34.83 91.49
0.11 30.11 15.32
0.24 33.84 0
0.09 27.91 0.72
2.16 48.68 0
Pattern 54.42 109.88 94145.82
Pattern 0.18 34.12 1127.11
Pattern 0.29 35.16 0
Pattern 0.25 47.54 178.29
Table 3: Performance of a single run of the algorithm on Trans dataset.

5.2 Comparison on Real World Graphs and Fixed Graph Patterns

We compare our implementation to three other tools for subgraph enumeration: RI algorithm [7] (as implemented in [22]), LAD algorithm [28] and color coding algorithm for weighted path enumeration [16] (by setting, for comparison purposes, all weights of edges to be equal). The comparison is done on the instances from previous subsection and only on pattern graphs which occur at least once in a particular target graph.

In comparison, note the following specifics of measured algorithms. The RI algorithm does not support outputting solutions, which might positively affect its performance. LAD algorithm uses adjacency matrix to store input graphs, and thus yields potentially limited use for graphs of larger scale. Neither of RI or LAD algorithms supports enumeration of disconnected patterns.444 When dealing with disconnected patterns, one could find the components of the pattern one by one, omitting the vertices of the host graph used by the previous component. However, this would basically raise the running time of the algorithm to the power equal to the number of components of the pattern graph. Also we did not measure the running time of the weighted path algorithm on non-path queries and also on Trans dataset, as its implementation is limited to graph sizes of at most .

We run our algorithm repeatedly to achieve an error rate of . In order to be able to measure the computation for larger networks with many occurrences of the pattern, we measure only the time required to retrieve no more than first solutions and we also consider running time greater than 10 minutes (600 seconds) as a timeout. Since we study non-induced occurrences (and due to automorphisms) there might be several ways to map the pattern to the same set of vertices. Other measured algorithms do count all of them. Our algorithm can behave also like this, or can be switched to count only occurrences that differ in vertex sets. For the sake of equal measurement, we use the former version of our algorithm.

Pattern Our algorithm RI algorithm LAD algorithm Weighted path
31.12 0.11 28.86 362.41
53.17 1.25 13.63
104.30 3.7 8.18
0.94 0.07 0.43
4.98 0.14 35.18
151.25 3.44 174.27
Pattern 43.11 0.82 36.60
Pattern 91.93 0.41 0.83
Pattern 23.54
Table 4: Comparison of running time on Images dataset (in seconds).
Pattern Our algorithm RI algorithm LAD algorithm Weighted path
11.64 2.53 59.57
44.72 4.77 34.00
295.11 24.11 28.98
19.25 0.56 24.58
4.38 0.70 2.24
Pattern 61.36 11.85 52.77
Pattern 23.91 1.63 31.67
Pattern 481.25
Table 5: Comparison of running time on Trans dataset (in seconds).

From Table 4 and Table 5, we can see that RI algorithm outperforms all other measured algorithms. We can also say our algorithm is on par with LAD algorithm, as the results of comparison of running times are similar, but vary instance from instance. Our algorithm nevertheless clearly outperforms another color coding algorithm, which on one hand solves more complicated problem of weighted paths, but on the another, is still limited only to paths. Also, our algorithm is the only algorithm capable of enumerating disconnected patterns.

The weak point of the color coding approach (or possibly only of our implementation) appears to be the search for a pattern of larger size with very few (or possibly zero) occurrences. To achieve the desired error rate, we need to repeatedly run the algorithm many times. Therefore our algorithm takes longer time to run on some instances (especially close to zero-occurrence ones), which are easily solved by the other algorithms.

5.3 ICPR2014 Contest Graphs

To fully benchmark our algorithm without limitations on time or number of occurrences found, we perform a test on ICPR2014 Contest on Graph Matching Algorithms for Pattern Search in Biological Databases [11].

In particular, we focus our attention on a Molecules dataset, containing 10,000 (target) graphs representing the chemical structures of different small organic compounds and on a Proteins dataset, which contains 300 (target) graphs representing the chemical structures of proteins and protein backbones. Target graphs in both datasets are sparse and up to 99 vertices or up 10,081 vertices for Molecules and Proteins, respectively.

In order to benchmark our algorithm without limiting its number of iterations, we focus on pattern graphs of small sizes, which offer reasonable number of iterations for an error rate of . Both datasets contain 10 patterns for each of considered sizes constructed by randomly choosing connected subgraphs of the target graphs. We obtained an average matching time of all pattern graphs of a given size to all target graphs in a particular dataset.

Targets Pattern size Our algorithm LAD algorithm RI algorithm
Molecules 4 0.01 0.01 0.01
Molecules 8 0.67 0.14 0.01
Proteins 8 19.45 8.83 0.51
Table 6: Comparison of average running time on ICPR2014 graphs

From the results in Table 6, we can see our algorithm being on par with LAD algorithm, while being outperformed by RI algorithm. However, we mainly include these results as a proof of versatility of our algorithm. As discussed in [25], benchmarks created by constructing subgraphs of target graphs do not necessarily encompass the possible hardness of some instances and might even present a distorted view on algorithms’ general performance. Thus, in the following benchmark we opt to theoretically analyze our algorithm.

5.4 Erdős-Rényi Graph Setup

In order to precisely analyze the strong and weak points of our algorithm we measure its performance is a setting where both the pattern and the target are taken as an Erdős-Rényi random graph of fixed size with varying edge density and compare the performance of our algorithm with the analysis of McCreesh et al. [25], which focused on algorithms Glasgow, LAD, and VF2.

An Erdős-Rényi graph is a random graph on vertices where each edge is included in the graph independently at random with probability

. We measure the performance on target graph of 150 vertices and pattern graph of 10 vertices with variable edge probabilities. As our algorithm cannot be classified in terms of search nodes used (as in

[25]), we measure the time needed to complete 10 iterations of our algorithm.

Figure 6: Behavior for target graph of 150 vertices and pattern graph of 10 vertices. The x-axis is the pattern edge probability, the y-axis is the target edge probability, from 0 to 1 with step of 0.03. Graph shows the time required for our algorithm to complete 10 iterations (the darker, the more time is required). Black regions indicate instances on which a timeout of 600 s occurred.

From Fig. 8

we can see our algorithm indeed follows a well observed phase transition (transition between instances without occurrence of the pattern and with many occurrences of the pattern). If we compare our results from Fig. 

6 to the results of [25], we can see that hard instances for our algorithm start to occur later (in terms of edge probabilities). However, due to the almost linear dependency of treewidth on edge probabilities (see Fig. 7), hard instances for our algorithm concentrate in the “upper right corner” of the diagram, which contains dense graphs with naturally large treewidth.

Figure 7: Correspondence of treewidth to the edge probability of a pattern graph with 10 vertices.
Figure 8: Time needed to complete 10 iterations of our algorithm on a target graph of 150 vertices with edge probability of 0.5 and pattern graph of 10 vertices with variable edge probability.
Figure 9: Time needed to complete 10 iterations of our algorithm on a target graph of 150 vertices with edge probability of 0.8 and pattern graph of 10 vertices with variable edge probability.

Therefore, it seems that our algorithm complements the portfolio of algorithms studied by Kotthoff et al. [20] by an algorithm suitable just below the phase transition (in view of Fig. 6).

6 Conclusion

We described an efficient implementation of the well known color coding algorithm for the subgraph isomorphism problem. Our implementation is the first color-coding based algorithm capable of enumerating all occurrences of patterns of treewidth larger than one. Moreover, we have shown that our implementation is competitive with existing state-of-the-art solutions in the setting of locating small pattern graphs. As it exhibits significantly different behaviour than other solutions, it can be an interesting contribution to the portfolio of known algorithms [20, 25].

As an obvious next step, the algorithm could be made to run in parallel. We also wonder whether the algorithm could be significantly optimized even further, possibly using some of the approaches based on constraint programming.

References

  • [1] Alon, N., Dao, P., Hajirasouliha, I., Hormozdiari, F., Sahinalp, S.: Biomolecular network motif counting and discovery by color coding. Bioinformatics 24, 241–249 (2008)
  • [2] Alon, N., Yuster, R., Zwick, U.: Color-coding. J. ACM 42(4), 844–856 (1995)
  • [3] Amini, O., Fomin, F.V., Saurabh, S.: Counting subgraphs via homomorphisms. SIAM J. Discrete Math. 26(2), 695–717 (2012)
  • [4] Arnborg, S., Corneil, D.G., Proskurowski, A.: Complexity of finding embeddings in a k-tree. SIAM J. Algebraic Discrete Methods 8(2), 277–284 (1987)
  • [5] Bodlaender, H.L., Fomin, F.V., Koster, A.M.C.A., Kratsch, D., Thilikos, D.M.: On exact algorithms for treewidth. ACM Trans. Algorithms 9(1), 12:1–12:23 (2012)
  • [6] Bodlaender, H.L., Koster, A.M.C.A.: Treewidth computations I. Upper bounds. Inf. Comput. 208(3), 259–275 (2010)
  • [7] Bonnici, V., Giugno, R., Pulvirenti, A., Shasha, D., Ferro, A.: A subgraph isomorphism algorithm and its application to biochemical data. BMC Bioinformatics 14, 1–13 (2013)
  • [8] Carletti, V., Foggia, P., Saggese, A., Vento, M.: Introducing VF3: A new algorithm for subgraph isomorphism. In: GbRPR 2017. LNCS, vol. 10310, pp. 128–139 (2017)
  • [9] Carletti, V., Foggia, P., Vento, M.: Performance comparison of five exact graph matching algorithms on biological databases. In: ICIAP 2013. LNCS, vol. 8158, pp. 409–417. Springer (2013)
  • [10] Carletti, V., Foggia, P., Vento, M.: VF2 plus: An improved version of VF2 for biological graphs. In: GbRPR 2015. LNCS, vol. 9069, pp. 168–177. Springer (2015)
  • [11] Carletti, V., Foggia, P., Vento, M., Jiang, X.: Report on the first contest on graph matching algorithms for pattern search in biological databases. In: GbRPR 2015. LNCS, vol. 9069, pp. 178–187. Springer (2015)
  • [12] Cordella, L.P., Foggia, P., Sansone, C., Vento, M.: Performance evaluation of the VF graph matching algorithm. In: 10th International Conference on Image Analysis and Processing, ICIAP 1999. pp. 1172–1177. IEEE Computer Society (1999)
  • [13] Cordella, L.P., Foggia, P., Sansone, C., Vento, M.: A (sub)graph isomorphism algorithm for matching large graphs. IEEE Trans. Pattern Anal. Mach. Intell. 26(10), 1367–1372 (2004)
  • [14] Cygan, M., Fomin, F.V., Kowalik, L., Lokshtanov, D., Marx, D., Pilipczuk, M., Pilipczuk, M., Saurabh, S.: Parameterized Algorithms. Springer (2015)
  • [15]

    Dahm, N., Bunke, H., Caelli, T., Gao, Y.: Efficient subgraph matching using topological node feature constraints. Pattern Recognition

    48(2), 317 – 330 (2015)
  • [16] Hüffner, F., Wernicke, S., Zichner, T.: Algorithm engineering for color-coding with applications to signaling pathway detection. Algorithmica 52(2), 114–132 (2008)
  • [17] Impagliazzo, R., Paturi, R.: On the complexity of k-SAT. J. Comput. Syst. Sci. 62(2), 367–375 (2001)
  • [18] Karp, R.M.: Reducibility among combinatorial problems. In: Symposium on the Complexity of Computer Computations, COCO 1972. pp. 85–103. The IBM Research Symposia Series, Plenum Press, New York (1972)
  • [19] Kimmig, R., Meyerhenke, H., Strash, D.: Shared memory parallel subgraph enumeration. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). pp. 519–529. IEEE Computer Society (2017)
  • [20] Kotthoff, L., McCreesh, C., Solnon, C.: Portfolios of subgraph isomorphism algorithms. In: LION 10. LNCS, vol. 10079, pp. 107–122. Springer (2016)
  • [21] Kuramochi, M., Karypis, G.: Frequent subgraph discovery. In: 2001 IEEE International Conference on Data Mining. pp. 313–320. IEEE Computer Society (2001)
  • [22] Leskovec, J., Sosič, R.: Snap: A general-purpose network analysis and graph-mining library. ACM Trans. on Intel. Systems and Technology (TIST) 8(1),  1 (2016)
  • [23]

    Marx, D.: Can you beat treewidth? Theory of Computing

    6(1), 85–112 (2010)
  • [24] McCreesh, C., Prosser, P.: A parallel, backjumping subgraph isomorphism algorithm using supplemental graphs. In: CP 2015. LNCS, vol. 9255, pp. 295–312. Springer (2015)
  • [25] McCreesh, C., Prosser, P., Solnon, C., Trimble, J.: When subgraph isomorphism is really hard, and why this matters for graph databases. J. Artif. Intell. Res. 61, 723–759 (2018)
  • [26] Slota, G.M., Madduri, K.: Fast approximate subgraph counting and enumeration. In: ICPP 2013. pp. 210–219. IEEE Computer Society (2013)
  • [27] Slota, G.M., Madduri, K.: Parallel color-coding. Parallel Computing 47, 51 – 69 (2015)
  • [28] Solnon, C.: AllDifferent-based filtering for subgraph isomorphism. Artif. Intell. 174(12-13), 850–864 (2010)
  • [29] Solnon, C., Damiand, G., de la Higuera, C., Janodet, J.C.: On the complexity of submap isomorphism and maximum common submap problems. Pattern Recognition 48(2), 302 – 316 (2015)
  • [30] Tamaki, H.: Positive-instance driven dynamic programming for treewidth. In: ESA 2017. LIPIcs, vol. 87, pp. 68:1–68:13. Schloss Dagstuhl (2017)
  • [31] Ullmann, J.R.: An algorithm for subgraph isomorphism. J. ACM 23(1), 31–42 (1976)
  • [32]

    Ullmann, J.R.: Bit-vector algorithms for binary constraint satisfaction and subgraph isomorphism. J. Exp. Algorithmics

    15, 1.6:1.1–1.6:1.64 (2011)
  • [33] Zhao, Z., Khan, M., Kumar, V.S.A., Marathe, M.V.: Subgraph enumeration in large social contact networks using parallel color coding and streaming. In: ICPP 2010. pp. 594–603. IEEE Computer Society (2010)