cliquematch: Finding correspondence via cliques in large graphs

11/30/2021
by   Gautham Venkatasubramanian, et al.
0

The maximum clique problem finds applications in computer vision, bioinformatics, and network analysis, many of which involve the construction of correspondence graphs to find similarities between two given objects. cliquematch is a Python package designed for this purpose: it provides a simple framework to construct correspondence graphs, and implements an algorithm to find and enumerate maximum cliques in C++, that can process graphs of a few million edges on consumer hardware, with comparable performance to publicly available methods.

READ FULL TEXT VIEW PDF

Authors

page 4

page 6

05/15/2020

GCLIQUE: An Open Source Genetic Algorithm for the Maximum Clique Problem

A clique in a graph is a set of vertices that are all connected to each ...
07/07/2020

Computing a maximum clique in geometric superclasses of disk graphs

In the 90's Clark, Colbourn and Johnson wrote a seminal paper where they...
07/17/2018

The Simplex Geometry of Graphs

Graphs are a central object of study in various scientific fields, such ...
08/24/2018

Detecting strong cliques

A strong clique in a graph is a clique intersecting every maximal indepe...
04/26/2002

Qualitative Analysis of Correspondence for Experimental Algorithmics

Correspondence identifies relationships among objects via similarities a...
04/20/2020

Semantic Correspondence via 2D-3D-2D Cycle

Visual semantic correspondence is an important topic in computer vision ...
07/22/2020

A class of graphs with large rankwidth

We describe several graphs of arbitrarily large rankwidth (or equivalent...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given an undirected graph , a subgraph of is a clique if an edge exists between any two vertices in . A clique in is a maximum clique if there exist no cliques of a larger size. The maximum clique problem bomze1999 is a special case of the maximal clique problem: a clique is maximal if it is not properly contained in any other clique, therefore all maximum cliques are also maximal. It is also related to the problem of clique enumeration.

Finding cliques in a graph is applicable to a variety of domains, such as bioinformatics, robotics, forensics, image analysis etc. conte2004. The applications transform into a clique problem in two general ways. In pradalier2003, vertices of the graph refer to elements of a dataset, and an edge indication function computes a relationship between every pair of elements. Alternatively, in horaud1989, a correspondence graph is constructed to find similar substructures between two different objects; vertices correspond to potential mappings between similar elements of the two objects, and edges reinforce the mappings.

cliquematch 111https://github.com/ahgamut/cliquematch is a Python package designed to construct correspondence graphs and find maximum cliques. It implements a modified version of a well-known maximum clique algorithm pattabiraman2015 in C++, and uses template programming to provide a simple framework for constructing correspondence graphs. cliquematch makes use of the pybind11 library to provide Python bindings. The remainder of this section provides a simple example using cliquematch . Section 2 describes the algorithm used by cliquematch when finding maximum cliques, Section 3 shows how problems of data association can be converted into finding maximum cliques in a graph. Section 4 provides examples showcasing how cliquematch can be used to solve such kinds of problems. Section 5 summarizes the properties of cliquematch and discusses future directions.

1.1 Basic Usage

The core functionality of cliquematch involves loading an undirected graph and finding a clique. The graph can be loaded from edge lists, adjacency matrices, adjacency lists, and text files that follow the Matrix Market Coordinate Text File format. 222https://math.nist.gov/MatrixMarket/formats.html

[autogobble,mathescape,linenos,numbersep=5pt]python import cliquematch G = cliquematch.Graph.from_file(”cond-mat-2003.mtx”) print(G) # cliquematch.core.Graph object at 0x559e7da730c0 # (n_vertices=31163,n_edges=120029,lower_bound=1,upper_bound=4294967295, # time_limit=1,use_heuristic=False,use_dfs=True,search_done=False) G.get_max_clique() # [9986, 9987, 10066, 10068, 10071, 10072, 10074, 10076, # 10077, 10078, 10079, 10080, 10081, 10082, 10083, 10085, # 10287, 10902, 10903, 10904, 10905, 10906, 10907, 10908, 10909]

The search for a clique can be modified by: (a) setting the bounds on clique size (via lower_bound and upper_bound

), (b) choosing to use the heuristic method, the depth-first search, or both (via

use_heuristic and use_dfs), and (3) setting a time limit for the search (via time_limit).

The search for maximum cliques can be resumed and interrupted intermittently using search_done and get_max_clique() in a loop, which is useful for incremental searching in the case of dense graphs. reset_search() resets the search for maximum cliques in case different bounds are required.

[autogobble,mathescape,linenos,numbersep=5pt]python G.reset_search() while not G.search_done: answer = G.get_max_clique( lower_bound=1, upper_bound=1729, use_heuristic=True, use_dfs=True, time_limit=100, continue_search=True )

The all_cliques() method can be used to obtain all cliques of a particular size from the graph . all_cliques() does not find all the cliques at once; the cliques are discovered upon the user’s repeated requests.

[autogobble,mathescape,linenos,numbersep=5pt]python import cliquematch G = cliquematch.Graph.from_file(”cond-mat-2003.mtx”) for clique in G.all_cliques(size=24): print(clique)

2 The Maximum Clique Problem - A Literature Review

The maximum clique problem is NP-Hard garey1979computers, and many algorithms for computing an exact solution have been discovered. These usually involve a possible optimal vertex ordering, fast heuristic bounds on the maximum clique size, followed by branch-and-bound: performing a depth-first search from each vertex to find cliques, and pruning the search space to avoid unnecessary calculations. The earliest such algorithm carraghan1990 sorts vertices in ascending order of degree, with search steps being pruned if they cannot beat the current maximum. [ostergaard2002] sorts vertices in descending order, and processes them in a defined sequence for better performance. MCQ tomita2003 first sorts vertices in descending order, and uses an approximate coloring for additional sorting of the vertices, which also helps pruning in the clique. A later version tomita2010 improves on the approximate coloring used so as to maximize pruning.

More recent algorithms for finding maximum cliques focus on massive sparse graphs; these may require specialized hardware, and attempt to use the parallel nature of the problem. FMC pattabiraman2015 prunes vertices with degree less than the current maximum clique size as early as possible, and ignores vertices that have already been processed; it also provides a degree-based heuristic method to obtain a lower bound on the maximum clique size. PMC rossi2015 uses the core-number seidman1983network of a vertex instead of the degree, which provides a tight lower and upper bound, thereby pruning the search space more effectively, and provides a parallel-friendly implementation based on OpenMP openmp1998. BBMC segundo2011 uses bitstrings of 64-bit machine words to encode the adjacency matrix and vertex sets, to benefit from bit-parallelism in set operations. BBMCSP segundo2016 defines a sparse encoding for bitstrings; it also unrolls the initial search step to avoid unnecessary recursive calls. It is interesting to note that pruning methods based on heuristics are not optimal for some kinds of real-world graphs; RMC lu2017 describes a probabilistic algorithm for finding maximum cliques, with examples and benchmarks showcasing potential limitations in heuristic-based methods.

The algorithm used in cliquematch is mostly similar to FMC: the depth-first search and the heuristic method both filter out vertices based on degree, and the search space is pruned based on potential to beat the current maximum. cliquematch also uses bitstrings compressed into 32-bit machine words, similar to BBMC, to represent the vertex sets during the clique search. However, cliquematch differs from FMC in three ways:

  • Instead of filtering out completed vertices (FMC Pruning 2) and neighbors of a vertex with lesser degree than the current maximum clique size (FMC Pruning 3), cliquematch filters out all neighbors of of lesser degree than , the degree of . This means that every maximum clique is now found using the vertex of least degree and the search is now amortized over all the vertices, which reduces reliance on vertex ordering.

  • The heuristic method returns a clique instead of just the lower bound. This helps users to obtain a clique quickly in case the branch-and-bound method is too slow.

  • The branch-and-bound method is repurposed to also provide clique enumeration. This allows to find all cliques of a given size, as there might be multiple maximum cliques. The clique enumeration is done in a lazy manner; new cliques are found incrementally upon request.

The performance of cliquematch on various benchmark graphs is comparable to existing C++ implementations, see Appendix A.

3 Correspondence Graphs

Finding maximum cliques in graphs can be applied to data association problems, where the aim is to find similarity between two objects by comparing their components. Such problems are found in bioinformatics gardiner1997, robotics pradalier2003, forensics fingerprint1999, image analysis horaud1989 etc. These problems can be solved by constructing a correspondence graph, a general form of the association graph kozen1978clique used for subgraph isomorphism. Given two given objects and , the correspondence graph constructs the largest possible correspondence between and by extending mappings in a pairwise manner.

Figure 1: A sample correspondence graph for sets of points in two dimensions. The corresponding points are marked in red, green, blue, and purple. The vertices of the correspondence graph which refer to these are marked with the same colors. Note that the edges between the colored vertices form a maximum clique, and the configurations of the corresponding points (thicker lines) are the same in both and .
Definition 1 (Correspondence Graph).

Let and be two sets of elements of length and respectively. Let be an undirected graph, where . An edge is drawn between and iff for a given boolean function

(1)

Finding a maximum clique in is equivalent to finding the largest correspondence between and as shown in the following step by step argument.

(a) .

(b) Let , and be such that

(c) is a clique, so there exists an edge between every pair of vertices in . Remember that an edge can be drawn only if Equation 1 is satisfied, therefore

because every vertex in is a pair of elements, one from and the other from . Hence, there exists a correspondence between and .

(d) is a maximum clique, so there exists no clique in that is larger than . Thus, and are subsets of and having the largest possible correspondence.

Note that requires two pairs of elements , when constructing an edge of the correspondence graph, and thus there is a pairwise correspondence between elements of and . can be optimized to benefit from properties of and . A common use case is if and are point-clouds in an -dimensional space (see Figure 1), the function can be:

where is a distance metric on , is a distance metric on , and is a small positive real number. Therefore, the edge construction rule in Equation 1 is modified to:

(2)

4 Applications

cliquematch can construct correspondence graphs where are either 2D numpy arrays or Python lists, via the below classes:

  • A2AGraph, where and are numpy arrays

  • L2LGraph, where and are lists of arbitrary objects, and

  • A2LGraph and L2AGraph, for cases that may require mapping a list of objects to numpy arrays of related data.

The user is required to define the function or the metrics for cliquematch to perform the construction of the graph. cliquematch uses pybind11 for Python wrappers, so one can define , , and as regular Python functions or Callable objects for fast prototyping. Note that accessing elements of and is done only within these functions.

[autogobble,mathescape,linenos,numbersep=5pt]python def euclidean(P, i1, i2): return sqrt(sum((P[i1]- P[i2]) ** 2))

class MyCustomCondition(object): def __call__(P, i1, i2, Q, j1, j2): if my_condition_works: return True return False

Once has been constructed as per the given conditions, cliquematch searches for cliques – the search parameters can be defined as per Subsection 1.1 – and returns the subsets with largest correspondence, as seen in the following examples.

4.1 Image Registration and Matching using interest points

Image registration can be converted into point-cloud registration by selecting a suitable function to obtain interest points, following which simple distance metrics can be used to construct a correspondence graph as in Equation 2. Once a maximum clique has been found, the sets of corresponding points can be used obtain a matching score, or perform a registration of the image.

CCMM segundo2015 performs feature matching for color images by computing SURF descriptors bay2006 to obtain interest points. The algorithm to construct a correspondence graph can be described as follows:

  • , are the sets of SURF keypoints in the first and second images.

  • and are the Euclidean metrics.

  • Additionally, apply a condition function that allows an edge from to if and only if is one of the top SURF descriptor matches of and is one of the top SURF descriptor matches of ; where is some integer.

  • The correspondence graph is constructed using the distance metrics and , for some appropriate values of and .

Figure 2: The CCMM algorithm tested on images from the Dinosaur dataset vggdatasets. The points in red are the corresponding points.

An implementation of the above steps using the cliquematch.A2AGraph and OpenCV is available on Github 333https://github.com/ahgamut/cliquematch/blob/master/examples/ccmm.py, a sample result is shown in Figure 2. A similar procedure can be followed for registering or matching a pair of images, based on the kinds of interest points extracted:

  • fingerprint1999 compares fingerprint images by selecting corresponding pairs of minutia: the vertices of the correspondence graph are mappings of minutia, and edges are drawn with respect to a function computing angle, distance, and ridge counts.

  • park2020 extracts SURF points and computes maximum cliques on a correspondence graph to perform alignment of footwear outsole impressions.

  • theiler2012 performs registration of laser scans by computing tie points, and uses description vectors for each point along with the Euclidean distance metric to ensure construction of a sparse graph. The correspondence computed via finding maximum cliques is used to register the scans.

4.2 Matching of Molecular Structures

The structure of molecules can be represented as an attributed graph, and therefore matching the 3-D structures of two different molecules can be converted into finding a clique in their correspondence graph. gardiner1997 provides a procedure for structure matching of molecules via correspondence graph, which can be described as follows444Implementation available at https://github.com/ahgamut/cliquematch/blob/master/examples/molecule.py:

  • , are the sets of atoms in the first and second molecules to be matched.

  • and are the Euclidean metrics to measure inter-atomic distances.

  • Additionally, apply a condition function that allows an edge if and only if a bond exists between the pairs of atoms being mapped.

  • The correspondence graph is constructed using the distance metrics and , for some appropriate value of . can be modified to account for additional properties (e.g. match ring bonds to ring bonds, valence).

Figure 3: Molecule structure matching using inter-atomic distances. The molecules were obtained from the datasets provided in sutherland2003.

An illustrative example using cliquematch is shown in Figure 3. A similar procedure is followed for matching protein molecules using their secondary structure elements (SSEs). butenko2006 gives an overview of applying clique-based methods in biochemistry.

4.3 Subgraph Isomorphism

Finding a subgraph isomorphism between two graphs can be solved by constructing a correspondence graph as described in kozen1978clique:

  • Let and be simple, unweighted, undirected graphs such that is isomorphic to a subgraph of . The vertices of the graphs are sets and .

  • Define a correspondence graph where .

  • Define a boolean condition function to construct edges in as below:

  • Once has been constructed, finding a maximum clique will give the vertices of the isomorphic subgraphs.

cliquematch provides an IsoGraph class which encapsulates the above functionality.

5 Conclusions and Future Work

I have described the capabilities of cliquematch , a Python package that finds maximum cliques in large sparse graphs and shown that its performance is comparable to other publicly available methods. I also provided examples showing that the implementation of cliquematch can be used to in solving data association problems in different domains, by constructing a correspondence graph and finding a maximum clique.

Multiple aspects of cliquematch can be developed further: the core search algorithm can be modified to find maximum cliques in a weighted graph. The computation time can improved with better heuristics, like vertex coreness and approximate coloring. The construction of correspondence graphs is applicable to many problem domains. There is also scope for clique augmentation or lenient matching methods (para-cliques) using the cliquematch design, and for providing GPU or streaming-specific implementations.

Acknowledgments

Some of this work was done at the National Institute of Standards and Technology (NIST) in support of the Forensic Footwear Research Project. I would like to thank Dr. Martin Herman, Dr. Steve Lund, and Dr. Hari Iyer of NIST for helpful discussions.

Appendix A Performance

Erdos02 6927 8472 7 0.0003 0.0013 0.0022 0.0002 6 0.0009 7
Erdos972 5488 7085 7 0.0002 0.0007 0.0015 0.0002 7 0.0002 6
Erdos982 5822 7375 7 0.0004 0.0005 0.0015 0.0002 7 0.0002 7
Erdos992 6100 7515 8 0.0002 0.0004 0.0017 0.0002 8 0.0002 8
Fault_639 638802 14626683 18 21.3265 14.456 - 1.2731 18 2.4945 18
brock200_2 200 9876 12 0.4408 0.6408 0.0018 0.0031 10 0.0023 9
c-fat200-5 200 8473 58 0.1485 0.4204 0.0003 0.0004 58 0.0106 58
ca-AstroPh 18772 198110 57 0.0024 0.0802 0.0137 0.0068 57 0.0286 57
ca-CondMat 23133 93497 26 0.001 0.0048 0.0083 0.0016 26 0.004 26
ca-GrQc 5242 14496 44 0.0002 0.0005 0.0018 0.0002 44 0.0011 44
ca-HepPh 12008 118521 239 0.0041 0.0138 0.016 0.0045 239 0.2589 239
ca-HepTh 9877 25998 32 0.0002 0.0007 0.0036 0.0001 32 0.0001 32
caidaRouterLevel 192244 609066 17 0.0672 0.2193 0.0784 0.0258 17 0.0723 15
coPapersCiteseer 434102 16036720 845 0.028 0.8812 1.8326 0.0501 845 16.2965 845
com-Youtube 1134890 2987624 17 2.467 10.6184 - 0.2301 16 0.3597 13
cond-mat-2003 31163 120029 25 0.0024 0.013 0.0104 0.0032 25 0.0054 25
cti 16840 48232 3 0.0283 0.0052 0.0048 0.0058 3 0.0014 3
hamming6-4 64 704 4 0.0018 0.0005 6.1989 0.0001 4 6.6042 4
johnson8-4-4 70 1855 14 0.6535 0.1582 0.0003 0.0005 14 0.0005 14
keller4 171 9435 11 10.8767 15.7847 - 0.0022 9 0.0027 9
loc-Brightkite 58228 214078 37 7.0798 2.9293 0.0271 0.0077 36 0.0155 31
Table 1: Comparing cliquematch performance on some benchmark graphs. and denote the number of nodes and edges in the graph. denotes the size of the maximum clique found. All the branch-and-bound methods agreed on the maximum clique size in every benchmark. , , and denote the time taken by cliquematch , FMC, and PMC respectively in the branch-and-bound search: the least time is in bold text. , denote the size of clique and time taken by the heuristic method in cliquematch ; and similarly for and . A minus sign (-) indicates that the program returned an error without completing the calculation.

The benchmark graphs were obtained from the Stanford SNAP collection snapnets, the University of Florida Sparse Matrix collection ufsparse, and the DIMACS Challenges ([dimacs2], [dimacs10]). I thank the authors of FMC 555http://cucis.ece.northwestern.edu/projects/MAXCLIQUE/download.html and PMC 666https://github.com/ryanrossi/pmc for making their source code publicly available.

I used gcc 7.5.0 to compile the programs at optimization level -O3. For cliquematch I set the BENCHMARKING flag to 1 before compilation. I compiled and tested the programs on a 64-bit Ubuntu 18.04 system with Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz and 4GB RAM. The following command line parameters were used:

  • fmc -t 0 -p was used to run the FMC branch-and-bound algorithm.

  • pmc -t 1 -r 1 -a 0 -h 0 -d (single CPU thread, reduce wait time of 1 second, full algorithm, skip heuristic, search in descending order) was used to run the PMC branch-and-bound algorithm.

  • fmc -t 1 was used to run the FMC heuristic algorithm.

  • A small python script similar to the code block in Subsection 1.1 was used to run the cliquematch algorithms.