Learning Multi-Stage Sparsification for Maximum Clique Enumeration

09/12/2019 ∙ by Marco Grassia, et al. ∙ University of Catania 0

We propose a multi-stage learning approach for pruning the search space of maximum clique enumeration, a fundamental computationally difficult problem arising in various network analysis tasks. In each stage, our approach learns the characteristics of vertices in terms of various neighborhood features and leverage them to prune the set of vertices that are likely not contained in any maximum clique. Furthermore, we demonstrate that our approach is domain independent – the same small set of features works well on graph instances from different domain. Compared to the state-of-the-art heuristics and preprocessing strategies, the advantages of our approach are that (i) it does not require any estimate on the maximum clique size at runtime and (ii) we demonstrate it to be effective also for dense graphs. In particular, for dense graphs, we typically prune around 30 % of the vertices resulting in speedups of up to 53 times for state-of-the-art solvers while generally preserving the size of the maximum clique (though some maximum cliques may be lost). For large real-world sparse graphs, we routinely prune over 99 % of the vertices resulting in several tenfold speedups at best, typically with no impact on solution quality.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A large number of optimization problems in diverse domains such as data mining, decision-making, planning, routing and scheduling are computationally hard (i.e., NP-hard). No efficient polynomial-time algorithms are known for these problems that can solve every instance of the problem to optimality and many researchers consider that such algorithms may not even exist. A common way to deal with such optimization problems is to design heuristics that leverage the structure in real-world instance classes for these problems. This is a time-consuming process where algorithm engineers and domain experts have to identify the key characteristics of the instance classes and carefully design algorithm for optimality on instances with those characteristics.

In recent years, researchers have started exploring if machine learning techniques can be used to (i) automatically identify characteristics of the instance classes and (ii) learn algorithms specifically leveraging those characteristics. In particular, recent advances in deep learning and graph convolutional networks have been used in an attempt to directly

learn the output of an optimization algorithm based on small training examples (see e.g., [Vinyals et al.2015, Bello et al.2016, Nowak et al.2017]). These approaches have shown promising early results on some optimization problems such as the Travelling Salesman Problem (TSP). However, there are two fundamental challenges that limit the widespread adoption of these techniques: (i) requirement of large amounts of training data whose generation requires solving the NP-hard optimization problem on numerous instances and (ii) the resultant lack of scalability (most of the reported results are on small test instances).

Recently, [Lauri and Dutta2019] proposed a probabilistic preprocessing framework to address the above challenges. Instead of directly learning the output of the NP-hard optimization problem, their approach learns to prune away a part of the input. The reduced problem instance can then be solved with exact algorithms or constraint solvers. Because their approach merely needs to learn the elements of the input it can confidently prune away, it needs significantly less training. This also enables it to scale to larger test instances. They considered the problem of maximum clique enumeration and showed that on sparse real-world instances, their approach pruned 75-98% of the vertices. Despite the conceptual novelty, the approach still suffered from (i) poor pruning on dense instances, (ii) poor accuracy on larger synthetic instances and (iii) non-transferability of training models across domains. In this paper, we build upon their work and show that we can achieve a significantly better accuracy-pruning trade-off, both on sparse and dense graphs, as well as cross-domain generalizability using a multi-stage learning methodology.

Maximum clique enumeration We consider the maximum clique enumeration (MCE) problem, where the goal is to list all maximum (as opposed to maximal) cliques in a given graph. The maximum clique problem is one of the most heavily-studied combinatorial problems arising in various domains such as in the analysis of social networks [Faust and Wasserman1995, Fortunato2010, Palla et al.2005, Papadopoulos et al.2012], behavioral networks [Bernard et al.1979], and financial networks [Boginski et al.2005]. It is also relevant in clustering [Stix2004, Yang et al.2016] and cloud computing [Wang et al.2014, Yao et al.2013]. The listing variant of the problem, MCE, is encountered in computational biology [Abu-Khzam et al.2005, Eblen et al.2012, Yeger-Lotem et al.2004, Bomze et al.1999] in problems like the detection of protein-protein interaction complex, clustering protein sequences, and searching for common cis-regulatory elements [Baldwin et al.2004].

It is NP-hard to even approximate the maximum clique problem within for any  [Zuckerman2006]. Furthermore, unless an unlikely collapse occurs in complexity theory, the problem of identifying if a graph of vertices has a clique of size is not solvable in time for any function  [Chen et al.2006]. As such, even small instances of this problem can be non-trivial to solve. Further, under reasonable complexity-theoretic assumptions, there is no polynomial-time algorithm that preprocesses an instance of -clique to have only vertices, where is any computable function depending solely on (see e.g., [Cygan et al.2015]). These results indicate that it is unlikely that an efficient preprocessing method for MCE exists that can reduce the size of input instance drastically while guaranteeing to preserve all the maximum cliques. In particular, it is unlikely that polynomial-time sparsification methods (see e.g., [Batson et al.2013]) would be applicable to MCE. This has led researchers to focus on heuristic pruning approaches.

A typical preprocessing step in a state-of-the-art solver is the following: (i) quickly find a large clique (say of size ), (ii) compute the core number of each vertex of the input graph , and (iii) delete every vertex of with core number less than . This can be equivalently achieved by repeatedly removing all vertices with degree less than . For example, the solver pmc [Rossi et al.2015] – which is regarded as “the leading reference solver” [San Segundo et al.2016] – use this as the only preprocessing method. However, there are two major downsides to this preprocessing step. First, it is crucially dependant on , the size of a large clique found. Since the maximum clique size is NP-hard to approximate within a factor of , maximum clique estimates with no formal guarantees are used. Second and more important, it is typical that even if the estimate was equal to the size of a maximum clique in , the core number of most vertices could be considerably higher than . This is particularly true in the case of dense graphs and it results in little or no pruning of the search space. Similarly, other preprocessing strategies (see e.g., [Eblen2010] for more discussion) depend on NP-hard estimates of specific graph properties and are not useful for pruning dense graphs.

Our Results

We demonstrate 30 % vertex pruning rates on average for dense networks, for which exact state-of-the-art methods are not able to prune anything, while typically only compromising the number of maximum cliques and not their size. For sparse networks, our preprocessor typically prunes well over 99 % of the vertices without compromising the solution quality. In both cases, these prunings result in speedups as high as several tenfold for state-of-the-art MCE solvers. For example, after the execution of our multi-stage preprocessor, we correctly list all the 196 maximum cliques (of size 24) in a real-world social network (socfb-B-anon) with 3 M vertices and 21 M edges in only 7 seconds of solver time, compared with 40 minutes of solver time with current state-of-the-art preprocessor (see Table 3).

2 Preliminaries and Related Work

Let be an undirected simple graph. A clique is a subset such that every two distinct vertices of are adjacent. We say that the vertices of form a -clique when . The clique number of , denoted by , is the size of a maximum clique in . A -coloring of is a function . A coloring is a -coloring for some . A coloring is proper if for every edge . The chromatic number of , denoted by , is the smallest such that has a proper -coloring. It is easy to see that as at least colors are needed to color a -clique. Finally, a -core of a graph is a maximal subgraph of where every vertex in the subgraph has degree at least in the subgraph. The core number of a vertex is the largest for which a -core containing exists.

Machine learning and Np-hard problems

There has been work on using machine learning to help tackle hard problems with different approaches. Some solve a problem by augmenting existing solvers [Liang et al.2016], predicting a suitable solver to run for a given instance [Fitzgerald et al.2015, Loreggia et al.2016], or attempting to discover new algorithms [Khalil et al.2017]. In contrast, some methods address the problems more directly. Examples include approaches to TSP [Hopfield and Tank1985, Fort1988, Durbin and Willshaw1987], with recent work in [Vinyals et al.2015, Bello et al.2016, Nowak et al.2017].

Maximal clique enumeration

We note that there are algorithms [Eppstein et al.2010, Cheng et al.2011] for maximal clique enumeration, in contrast to our problem of maximum clique enumeration. The two set of algorithms are required in very different applications, and the runtime of maximal clique enumeration is generally significantly higher.

Probabilistic preprocessing

Recently, [Lauri and Dutta2019] proposed a probabilistic preprocessing framework for fine-grained search space classification. It treats individual vertices of as classification problems and the problem of learning a preprocessor reduces to that of learning a mapping from a set of training examples , where is a vertex, a class label, and a mapping from a vertex to a -dimensional feature space. To learn the mapping from

, a probabilistic classifier

is used which outputs a probability distribution over

for a given for . Then, on input graph , all vertices from that are predicted by to not be in a solution with probability at least (for some confidence threshold ) are pruned away. Here, trades-off the pruning rate with the accuracy of the pruning.

This framework showed that there is potential for learning a heuristic preprocessor for instance size pruning. However, the speedups obtained were limited and the training models were not transferable across domains. We build upon this work and show that we can achieve cross-domain generalizability and considerable speedups, both on sparse and dense graphs, using a multi-stage learning methodology.

3 Proposed framework

In this section, we introduce our multi-stage preprocessing approach and then give the features that we use for pruning.

Multi-stage sparsification

A major difficulty with the probabilistic preprocessing described above is that when training on sparse graphs, the learnt model focused too heavily on pruning out the easy cases, such as low-degree vertices and not on the difficult cases like vertices with high degree and high core number. To improve the accuracy on difficult vertices, we propose a multi-stage sparsification approach. In each stage, the approach focuses on gradually harder cases that were difficult to prune by the classifier in earlier stages.

Let be the input set of networks. Consider a graph . Let be the set of all maximum cliques of , and denote by the set of all vertices in . The positive examples in the training set consist of all vertices that are in some maximum clique () and the negative examples are the ones in the set

. Since the training dataset can be highly skewed, we under-sample the larger class to achieve a balanced training data. A probabilistic classifier

is trained on the balanced training data in stage . Then, in the next stage, we remove all vertices that were predicted by to be in the negative class with a probability above a predefined threshold . We focus on the set of subgraphs (of graphs in ) induced on the remaining vertices and repeat the above process. The positive examples in the training set consists of all vertices in some maximum clique () and the negative examples are the ones in the set , training dataset is balanced by under-sampling and we use that balanced dataset to learn the probabilistic classifier . We repeat the process for stages.

As we show later, the multi-stage sparsification results in significantly more pruning compared to a single-stage probabilistic classifier.

Figure 1: While the shown proper 3-coloring is optimal, we can swap the non-white colors in either triangle to see that .

Graph-theoretic features

We use the following graph-theoretic features: (F1) number of vertices, (F2) number of edges, (F3) vertex degree, (F4) local clustering coefficient (LCC), and (F5) eigencentrality.

The crude information captured by features (F1)-(F3) provide a reference for the classifier for generalizing to different distributions from which the graph might have been generated. Feature (F4), the LCC of a vertex is the fraction of its neighbors with which the vertex forms a triangle, encapsulating the well-known small world phenomenon. Feature (F5) eigencentrality represents a high degree of connectivity of a vertex to other vertices, which in turn have high degrees as well. The eigenvector centrality

is the eigenvector of the adjacency matrix


with the largest eigenvalue

, i.e., it is the solution of . The th entry of is the eigencentrality of vertex . In other words, this feature provides a measure of local “denseness”. A vertex in a dense region shows higher probability of being part of a large clique.

Statistical features

In addition, we use the following statistical features: (F6) the value over vertex degree, (F7) average value over neighbor degrees, (F8) value over LCC, and (F9) average value over neighbor LCCs.

The intuition behind (F6)-(F9) is that for a vertex present in a large clique, its degree and LCC would deviate from the underlying expected distribution characterizing the graph. Further, the neighbors of also present in the clique would demonstrate such behaviour. Indeed, statistical features have been shown to be robust in approximately capturing local structural patterns [Dutta et al.2017].

Statistical significance is captured by the notion of p-value [Read and Cressie1988], and well-estimated [Read and Cressie1989] by the Pearson’s chi-square statistic, , computed as , where and are the observed and expected number of occurrences of the possible outcomes .

Local chromatic density

Let be a graph. We define the local chromatic density of a vertex , denoted by , as the minimum ratio of the number of distinct colors appearing in and any optimal proper coloring of . Put differently, the local chromatic density of is the minimum possible number of colors in the immediate neighborhood of in any optimal proper coloring of (see Figure 1).

We use the local chromatic density as the feature (F10). A vertex with high means that the neighborhood of is dense, as it captures the adjacency relations between the vertices in . Thus, a vertex in such a dense region has a higher chance of belonging to a large clique.

However, the problem of computing is computationally difficult. In the decision variant of the problem, we are given a graph , a vertex , and a ratio . The task is to decide whether there is proper -coloring of witnessing . The omitted proof is by a polynomial-time reduction from graph coloring.

Theorem 1.

Given a graph , , and , it is NP-hard to decide whether .

Despite its computational hardness, we can in practice compute by a heuristic. Indeed, to compute for every , we first compute a proper coloring for using e.g., the well-known linear-time greedy heuristic of [Welsh and Powell1967]. After a proper coloring has been computed, we compute the described ratio for every vertex from that.

bio soc socfb web all
W/o With W/o With W/o With W/o With W/o With
0.95 0.98 0.89 0.99 0.90 0.95 0.96 0.99 0.87 0.96
Table 1: The effect of introducing the feature (F10) the local chromatic density into the feature set. The column “W/o” is the vertex classification accuracy of the classifier of [Lauri and Dutta, 2019] without (F10) while column “With” is the same with (F10).

Learning over edges

Instead of individual vertices, we can view the framework also over individual edges. In this case, the goal is to find a mapping , and the training set

contains feature vectors corresponding to edges instead of vertices. We also briefly explore this direction in this work.

Edge features

We use the following features (E1)-(E9) for an edge . (E1) Jaccard similarity is the number of common neighbors of and divided by the number of vertices that are neighbors of at least one of and . (E2) Dice similarity is twice the number of common neighbors of and , divided by the sum of their degrees. (E3) Inverse log-weighted similarity is as the number of common neighbors of and weighted by the inverse logarithm of their degrees. (E4)Cosine similarity is the number of common neighbors of and

divided by the geometric mean of their degrees. The next three features are inspired by the vertex features:

(E5) average LCC over and , (E6) average degree over and , and (E7) average eigencentrality over and . (E8) is the number of length-two paths between and . Finally, we use (E9) local edge-chromatic density, i.e., the number of distinct colors on the common neighbors of and divided by the total number of colors used in any optimal proper coloring.

The intuition behind (E1)-(E4) is well-established for community detection; see e.g., [Harenberg et al.2014] for more. For (E8), observe that the number of length-two paths is high when the edge is part of a large clique, and at most when is an edge of a complete graph on vertices. Notice that (E9) could be converted into a deterministic rule: the edge can be safely deleted if the common neighbors of and see less than colors in any proper coloring of the input graph , where is an estimate for . To our best knowledge, such a rule has not been considered previously in the literature. Further, notice that there are situations in which this rule can be applied whereas the similar vertex rule uncovered from (F10) cannot. To see this, let be a graph consisting of two triangles and , connected by an edge , and let . The vertex rule cannot delete nor , but the described edge rule removes .

4 Experimental results

In this section, we describe how multi-stage sparsification is applied to the MCE problem and our computational results.

To allow for a clear comparison, we follow closely the definitions and practices specified in [Lauri and Dutta2019]. Thus, unless otherwise mentioned and to save space, we refer the reader to that work for additional details.

All experiments ran on a machine with Intel Core i7-4770K CPU (3.5 GHz), 8 GB of RAM, running Ubuntu 16.04.

Training and test data

All our datasets are obtained from Network Repository [Rossi and Ahmed2015] (available at http://networkrepository.com/).

For dense networks, we choose a total of 30 networks from various categories with the criteria that the edge density is at least 0.5 in each. We name this category “dense”. The test instances are in Table 2, chosen based on empirical hardness (i.e., they are solvable in reasonable amount of time).

For sparse networks, we choose our training data from four different categories: 31 biological networks (“bio”), 32 social networks (“soc”), 107 Facebook networks (“socfb”), and 13 web networks (“web”). In addition, we build a fifth category “all” that comprises all networks from the mentioned four categories. The test instances are in Table 3.

Feature computation

We implement the feature computation in C++, relying on the igraph [Csardi and Nepusz2006] C graph library. In particular, our feature computation is single-threaded with further optimization possible.

Domain oblivious training via local chromatic density

In [Lauri and Dutta2019], it was assumed that the classifier should be trained with networks coming from the same domain, and that testing should be performed on networks from that domain. However, we demonstrate in Table 1 that a classifier can be trained with networks from various domains, yet predictions remain accurate across domains (see column “all”). The accuracy is boosted considerably by the introduction of the local chromatic density (F10) into the feature set (see Table 1). In particular, when generalizing across various domains, the impact on accuracy is almost 10 %. For this reason, rather than focusing on network categories, we only consider networks by edge density (at least 0.5 or not).

State-of-the-art solvers for MCE

To our best knowledge, the only publicly available solvers able to list all maximum cliques222For instance, pmc [Rossi et al.2015] does not have this feature. are cliquer [Östergård2002], based on a branch-and-bound strategy; and MoMC [Li et al.2017], introducing incremental maximum satisfiability reasoning to a branch-and-bound strategy. We use these solvers in our experiments.

4.1 Dense networks

In this subsection, we show results for probabilistic preprocessing on dense networks (i.e., edge density at least 0.5).

Instance n.  -core 1-stage Pruning cliquer MoMC
brock200-1 200 14.8 K 21 (20) 2 (16) 0.34 0.55 0.01 0.39 (53.07) 0.04 (44.57)
keller4 171 9.4 K 11* 2304 (37) 0.30 0.50 0.01 0.01 (38.11) 0.02 (5.68)
keller5 776 226 K 27* 1000 (5) 0.28 0.48 0.19 t/o 1421.24 (2.53)
p-hat300-3 300 33.4 K 36* 10* 0.38 0.58 0.02 87.1 (9.12) 0.05 (6.00)
p-hat500-3 500 93.8 K 50* 62 (40) 0.34 0.52 0.07 t/o 2.51 (5.98)
p-hat700-1 700 61 K 11* 2* 0.36 0.47 0.03 0.08 (1.22) 0.05 (1.30)
p-hat700-2 700 121.7 K 44* 138* 0.36 0.45 0.11 t/o 1.35 (—)
p-hat1000-1 1 K 122.3 K 10* 276 (165) 0.36 0.47 0.08 0.86 (2.22) 0.71 (1.67)
p-hat1500-1 1.5 K 284.9 K 12 (11) 1 (376) 0.33 0.43 0.25 13.18 (—) 3.2 (1.54)
fp 7.5 K 841 K 10* 1001* 0.06 0.29 0.36 0.65 (—) 5.19 (1.13)
nd3k 9 K 1.64 M 70* 720* 0.23 0.28 1.28 t/o 7.05 (1.09)
raefsky1 3.2 K 291 K 32* 613 (362) 0.33 0.38 0.11 2.80 (—) 0.31 (1.36)
HFE18_96_in 4 K 993.3 K 20* 2* 1e-4 1e-4 0.26 0.27 0.49 58.88 (1.05) 4.30 (1.18)
heart1 3.6 K 1.4 M 200* 45 (26) 1e-4 1e-4 0.19 0.25 0.66 t/o 19.37 (—)
cegb2802 2.8 K 137.3 K 60* 101 (38) 0.09 0.04 0.39 0.46 0.09 0.05 (—) 0.15 (1.61)
movielens-1m 6 K 1 M 31* 147* 0.05 0.007 0.22 0.23 0.98 31.31 (—) 2.85 (1.14)
ex7 1.6 K 52.9 K 18* 199 (127) 0.02 0.01 0.26 0.28 0.04 0.01 (—) 0.1 (1.29)
Trec14 15.9 K 2.87 M 16* 99* 0.16 0.009 0.34 0.15 2.19 3.62 (—) 0.35 (—)
Table 2: Experiments for dense graphs. The column “” is the max. clique size and the column “n. ” is the number of such cliques. In both, * means the quantity is preserved in the preprocessed instance; otherwise the new quantity is in parenthesis. The multicolumns “-core” and “1-stage” give the vertex pruning ratio followed by the edge pruning ratio when preprocessed by removing vertices of core number and our preprocessor, respectively. For the last three columns, all runtimes are in seconds averaged over three independent runs. The column “Pruning” is the time for feature computation and pruning. The two remaining columns give the runtime of a solver, containing the runtime on the pruned instance with the speedup obtained in parenthesis. We denote by t/o killed execution after an hour and — denotes no speedup.

Classification framework for dense networks

For training, we get 4762 feature vectors from our “dense” category. As a baseline, a 4-fold cross validation over this using logistic regression from 

[Lauri and Dutta2019] results in an accuracy of 0.73. We improve on this by obtaining an accuracy of 0.81

with gradient boosted trees (further details omitted), found with the help of

auto-sklearn [Feurer et al.2015].

Search strategies

Given the empirical hardness of dense instances, one should not expect a very high accuracy with polynomial-time computable features such as (F1)-(F10). For this reason, we set the confidence threshold here.

The failure of -core decomposition on dense graphs

It is common that widely-adopted preprocessing methods like the -core decomposition cannot prune any vertices on a dense network , even if they had the computationally expensive knowledge of . This is so because the degree of each vertex is higher than than the maximum clique size .

We showcase precisely this poor behaviour in Table 2. For most of the instances, the -core decomposition with the exact knowledge of cannot prune any vertices. In contrast, the probabilistic preprocessor prunes typically around 30 % of the vertices and around 40 % of the edges.


Given that around 30 % of the vertices are removed, how many mistakes do we make? For almost all instances we retain the clique number, i.e., , where is the instance obtained by preprocessing (see column “” in Table 2). In fact, the only exceptions are brock200-1 and p-hat1500-1, for which still holds. Importantly, for about half of the instances, we retain all optimal solutions.


We show speedups for the solvers after executing our pruning strategy in Table 2 (last two columns). We obtain speedups as large as 53x and for 38x brock200-1 and keller4, respectively. This might not be surprising, since in both cases we lose some maximum cliques (but note that for keller4, the size of a maximum clique is still retained). For p-hat300-3, the preprocessor makes no mistakes, resulting in speedups of upto 9x. The speedup for keller5 is at least 2.5x, since the original instance was not solved within 3600 seconds, but the preprocessed instances was solved in roughly 1421 seconds.

Most speedups are less than 2x, explained by the relative simplicity of instances. Indeed, it seems challenging to locate dense instances of MCE that are (i) structured and (ii) solvable within a reasonable time.

4.2 Sparse networks

In this subsection, we show results for probabilistic preprocessing on sparse networks (i.e., edge density below 0.5).

Classification framework for sparse networks

We use logistic regression trained with stochastic gradient descent.

Instance n.  -core 5-stage Pruning cliquer MoMC
bio-WormNet-v3 16 K 763 K 121* 18* 0.868 0.602 0.987 0.975 0.36 0.37 (—) 0.40 (3.94)
ia-wiki-user-edits-page 2 M 9 M 15* 15* 0.958 0.641 0.997 0.946 1.12 1.16 (29.94) s
rt-retweet-crawl 1 M 2 M 13* 26* 0.979 0.863 0.997 0.989 0.38 0.41 (5.66) s
soc-digg 771 K 6 M 50* 192* 0.969 0.496 0.998 0.964 4.80 4.91 (1.78) s
soc-flixster 3 M 8 M 31* 752* 0.986 0.834 0.999 0.989 1.32 1.41 (3.86) s
soc-google-plus 211 K 2 M 66* 24* 0.986 0.785 0.998 0.972 0.35 0.35 (—) 0.41 (3.98)
soc-lastfm 1 M 5 M 14* 330 (324) 0.933 0.625 0.993 0.938 2.24 2.57 (10.56) s
soc-pokec 2 M 22 M 29* 6* 0.824 0.595 0.975 0.940 17.59 24.40 (45.80) s
soc-themarker 69 K 2 M 22* 40* 0.713 0.151 0.972 0.842 2.03 4.95 (—) s
soc-twitter-higgs 457 K 15 M 71* 14* 0.852 0.540 0.986 0.943 9.52 9.85 (1.92) s
soc-wiki-Talk-dir 2 M 5 M 26* 141* 0.993 0.830 0.999 0.970 1.09 3.47 (1.25) s
socfb-A-anon 3 M 24 M 25* 35* 0.879 0.403 0.984 0.907 28.49 38.05 (55.95) s
socfb-B-anon 3 M 21 M 24* 196* 0.884 0.378 0.986 0.920 28.33 35.49 (67.46) s
socfb-Texas84 36 K 2 M 51* 34* 0.540 0.322 0.957 0.941 1.04 1.07 (1.32) s
tech-as-skitter 2 M 11 M 67* 4* 0.997 0.971 1.000 0.998 0.28 0.28 (—) 0.36 (4.31)
web-baidu-baike 2 M 18 M 31* 4* 0.933 0.618 0.992 0.934 9.67 11.00 (7.48) s
web-google-dir 876 K 5 M 44* 8* 1.000 0.999 1.000 1.000 0.00 0.00 (—) 0.00 (2.06)
web-hudong 2 M 15 M 267 (266) 59 (1) 1.000 0.996 1.000 0.997 0.09 0.10 (—) 0.1 (9.99)
web-wikipedia2009 2 M 5 M 31* 3* 0.999 0.988 1.000 1.000 0.03 0.03 (—) 0.03 (4.28)
Table 3: Experiments for sparse graphs. The columns are precisely as in Table 2, with the exception that we show pruning ratios for 5 stages. All ratios are rounded to three decimal places. Underlined ratios of 1.000 mean the ratio is precisely 1, otherwise it is between 1 and 0.999.

Implementing the -core decomposition

Recall the exact state-of-the-art preprocessor: (i) use a heuristic to find a large clique (say of size ) and (ii) delete every vertex of of core number less than . For sparse graphs, a state-of-the-art solver pmc has been reported to find large cliques, i.e., typically is at most a small additive constant away from (a table of results seen at http://ryanrossi.com/pmc/download.php). Further, given that some real-world sparse networks are scale-free (many vertices have low degree) the -core decomposition can be effective in practice.

To ensure highest possible prune ratios for the -core decomposition method, we supply it with the number instead of an estimate provided by any real-world implementation. This ensures ideal conditions: (i) the method always prunes as aggressively as possible, and (ii) we further assume its execution has zero cost.We call this method the -oracle.

Test instance pruning

Before applying our preprocessor on the sparse test instances, we prune them using the -oracle. This ensures that the pruning we report is highly non-trivial, while also speeding up feature computation.

Search strategies

We experiment with the following two multi-stage search strategies:

  • Constant confidence (CC): at every stage, perform probabilistic preprocessing with confidence threshold .

  • Increasing confidence (IC): at the first stage, perform probabilistic preprocessing with confidence threshold , progressing by for every later stage.

Our goal is two-fold: to find (i) a number of stages and (ii) parameters and , such that the strategy never errs while pruning as aggressively as possible. We do a systematic search over parameters , , and . For the CC strategy, we let and . For the IC strategy, we try , , and set so that in the last stage the confidence is 0.95.

We find the CC strategy with to prune the highest while still retaining all optimal solutions. Thus, for the remaining experiments, we use a CC strategy with .

Our 5-stage strategy outperforms, almost always safely, the -oracle (see Table 3). In particular, note that even if the difference between the vertex pruning ratios is small, the impact for the number of edges removed can be considerable (see e.g., all instances of the “soc” category).


We show speedups for the solvers in Table 3. We use as a baseline the solver executed on an instance pruned by the -oracle, which renders many of the instances easy already. Most notably, this is not the case for soc-pokec, socfb-A-anon, and socfb-B-anon, all requiring at least 5 minutes of solver time. The largest speedup is for socfb-B-anon, where we go from requiring 40 minutes to only 7 seconds of solver time. For MoMC, most instances report a segmentation fault for an unknown reason.

Comparison against Lauri and Dutta

The results in Table 3 are not directly comparable to those in [Lauri and Dutta2019, Table 1]. First, the authors only give vertex pruning ratios. While the difference in vertex pruning ratios might sometimes seem underwhelming, even small increases can translate to large decrements in the number of edges. On the other hand, the difference is often clear in our favor as in socfb-Texas84 and bio-WormNet-v3 (i.e., 0.76 vs. 0.96 and 0.90 vs. 0.99). Second, the authors use estimates on – almost always less than the exact value – whereas we use the exact value provided by the -oracle. Thus, the speedups we report are as conservative as possible unlike theirs.

4.3 Edge-based classification

For edges, we do a similar training as that described for vertices. For the category “dense”, we obtain 79472 feature vectors. Further, for this category, the edge classification accuracy is 0.83, which is 1 % higher than the vertex classification accuracy using the same classifier as in Subsection 4.1. However, we note that the edge feature computation is noticeably slower than that for vertex features.

4.4 Model analysis

Gradient boosted trees (used with dense networks in Subsection 4.1) naturally output feature importances. We apply the same classifier for the sparse case to allow for a comparison of feature importance. In both cases, the importance values are distributed among the ten features and sum up to one.

Unsurprisingly, for sparse networks, the local chromatic density (F10) dominates (importance 0.22). In contrast, (F10) is ineffective for dense networks (importance 0.08), since the chromatic number tends to be much higher than the maximum clique size. In both cases, (F5) eigencentrality has relatively high importance, justifying its expensive computation.

For dense networks, (F7) average over neighbor degrees has the highest importance (importance 0.23), whereas in the sparse case it is least important feature (importance 0.03). This is so because all degrees in a dense graph are high and the degree distribution tends to be tightly bound or coupled. Hence, even slight deviations from the expected (e.g., vertices in large cliques) depict high statistical significance scores.

5 Discussion and conclusions

We proposed a multi-stage learning approach for pruning the search space of MCE, generalizing an earlier framework of [Lauri and Dutta2019]. In contrast to known exact preprocessing methods, our approach requires no estimate for the maximum clique size at runtime – a task NP-hard to even approximate – and particularly challenging on dense networks. We provide an extensive empirical study to show that our approach can routinely prune over 99 % of vertices in sparse graphs. More importantly, our approach can typically prune around 30 % of the vertices on dense graphs, which is considerably more than the existing methods based on -cores.

Future improvements

To achieve even larger speedups, one can consider parallelization of the feature computation (indeed, our current program is single-threaded). In addition, at every stage, we recompute all features from scratch. There are two obvious ways to speed this part: (i) it is unnecessary to recompute a local feature (e.g., degree or local clustering coefficient) for vertex if none of its neighbors were removed, and (ii) more generally, there is considerable work in the area of dynamic graph algorithms under vertex deletions. Another improvement could be to switch more accurate but expensive methods for feature computation (e.g., (F10) which is NP-hard) when the graph gets small enough.

Dynamic stopping criteria

We refrained from multiple stages of preprocessing for dense networks due to the practical hardness of the task. However, for sparse networks, it was practically always safe to perform five (or even more) of stages of preprocessing with no effect on solution quality. An intriguing open problem is to propose a dynamic strategy for choosing a suitable number of stages . There are several possibilities, such as stopping when pruning less than some specified threshold, or always pruning aggressively (say up to ), computing a solution, and then backtracking by restoring the vertices deleted in the previous stage, halting when the solution does not improve anymore.

Classification over edges

To speed up our current implementation for edge feature computation, a first step could be a well-engineered neighborhood intersection to speed up (E1)-(E4). Luckily, this is a core operation in database systems with many high-performance implementations available [Lemire et al.2016, Inoue et al.2014, Lemire and Boytsov2015]. Further path-based features are also possible, like the number of length- paths for , which is still computed cheaply via e.g., matrix multiplication. For larger , one could rely on estimates based on random walk sampling. In addition, it is possible to leave edge classification for the later stages, once the vertex classifier has reduced the size of the input graph enough. We believe that there is further potential in exploring this direction.


  • [Abu-Khzam et al.2005] F. N. Abu-Khzam, N. E. Baldwin, M. A. Langston, and N. F. Samatova. On the relative efficiency of maximal clique enumeration algorithms, with applications to high-throughput computational biology. In International Conference on Research Trends in Science and Technology, 2005.
  • [Baldwin et al.2004] N. E. Baldwin, R. L. Collins, M. A. Langston, M. R. Leuze, C. T. Symons, and B. H. Voy. High performance computational tools for Motif discovery. In IPDPS, 2004.
  • [Batson et al.2013] Joshua Batson, Daniel A Spielman, Nikhil Srivastava, and Shang-Hua Teng. Spectral sparsification of graphs: theory and algorithms. Communications of the ACM, 56(8):87–94, 2013.
  • [Bello et al.2016] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio.

    Neural combinatorial optimization with reinforcement learning.

    arXiv, 2016.
  • [Bernard et al.1979] H. R. Bernard, P. D. Killworth, and L. Sailer. Informant accuracy in social network data iv: a comparison of clique-level structure in behavioral and cognitive network data. Social Networks, 2(3):191–218, 1979.
  • [Boginski et al.2005] V. Boginski, S. Butenko, and P. M. Pardalos. Statistical analysis of financial networks. Computational Statistics and Data Analysis, 48(2):431–443, 2005.
  • [Bomze et al.1999] I. Bomze, M. Budinich, P. Pardalos, and M. Pelillo. The Maximum Clique Problem. In Handbook of Combinatorial Optimization, volume 4, pages 1–74. Kluwer Academic Publishers, 1999.
  • [Chen et al.2006] Jianer Chen, Xiuzhen Huang, Iyad A. Kanj, and Ge Xia. Strong computational lower bounds via parameterized complexity. JCSS, 72(8):1346–1367, 2006.
  • [Cheng et al.2011] James Cheng, Yiping Ke, Ada Wai-Chee Fu, Jeffrey Xu Yu, and Linhong Zhu. Finding maximal cliques in massive networks. TODS, 36(4):21, 2011.
  • [Csardi and Nepusz2006] Gabor Csardi and Tamas Nepusz. The igraph software package for complex network research. InterJournal, Complex Systems:1695, 2006.
  • [Cygan et al.2015] Marek Cygan, Fedor V. Fomin, Łukasz Kowalik, Daniel Lokshtanov, Dániel Marx, Marcin Pilipczuk, Michał Pilipczuk, and Saket Saurabh. Parameterized Algorithms. Springer, 2015.
  • [Durbin and Willshaw1987] Richard Durbin and David Willshaw. An analogue approach to the travelling salesman problem using an elastic net method. Nature, 326(6114):689, 1987.
  • [Dutta et al.2017] S. Dutta, P. Nayek, and A. Bhattacharya. Neighbor-Aware Search for Approximate Labeled Graph Matching using the Chi-Square Statistics. In WWW, pages 1281–1290, 2017.
  • [Eblen et al.2012] John D Eblen, Charles A Phillips, Gary L Rogers, and Michael A Langston. The maximum clique enumeration problem: algorithms, applications, and implementations. In BMC bioinformatics, volume 13, page S5. BioMed Central, 2012.
  • [Eblen2010] John David Eblen. The maximum clique problem: Algorithms, applications, and implementations. 2010.
  • [Eppstein et al.2010] David Eppstein, Maarten Löffler, and Darren Strash. Listing all maximal cliques in sparse graphs in near-optimal time. In Algorithms and Computation, pages 403–414. Springer, 2010.
  • [Faust and Wasserman1995] K. Faust and S. Wasserman. Social network analysis: Methods and applications. Cambridge University Press, 1995.
  • [Feurer et al.2015] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning. In NIPS, pages 2962–2970. Curran Associates, Inc., 2015.
  • [Fitzgerald et al.2015] Tadhg Fitzgerald, Yuri Malitsky, and Barry O’Sullivan. ReACTR: Realtime Algorithm Configuration through Tournament Rankings. In IJCAI, pages 304–310, 2015.
  • [Fort1988] J. C. Fort. Solving a combinatorial problem via self-organizing process: An application of the Kohonen algorithm to the traveling salesman problem. Biological Cybernetics, 59(1):33–40, 1988.
  • [Fortunato2010] Santo Fortunato. Community detection in graphs. Physics reports, 486(3-5):75–174, 2010.
  • [Harenberg et al.2014] Steve Harenberg, Gonzalo Bello, L Gjeltema, Stephen Ranshous, Jitendra Harlalka, Ramona Seay, Kanchana Padmanabhan, and Nagiza Samatova. Community detection in large-scale networks: a survey and empirical evaluation. Wiley Interdisciplinary Reviews: Computational Statistics, 6(6):426–439, 2014.
  • [Hopfield and Tank1985] John J Hopfield and David W Tank. “Neural” computation of decisions in optimization problems. Biological cybernetics, 52(3):141–152, 1985.
  • [Inoue et al.2014] Hiroshi Inoue, Moriyoshi Ohara, and Kenjiro Taura. Faster set intersection with simd instructions by reducing branch mispredictions. Proceedings of the VLDB Endowment, 8(3):293–304, 2014.
  • [Khalil et al.2017] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In NIPS, pages 6351–6361, 2017.
  • [Lauri and Dutta2019] Juho Lauri and Sourav Dutta. Fine-grained search space classification for hard enumeration variants of subset problems. In AAAI. 2019.
  • [Lemire and Boytsov2015] Daniel Lemire and Leonid Boytsov. Decoding billions of integers per second through vectorization. Software: Practice and Experience, 45(1):1–29, 2015.
  • [Lemire et al.2016] Daniel Lemire, Leonid Boytsov, and Nathan Kurz. Simd compression and the intersection of sorted integers. Software: Practice and Experience, 46(6):723–749, 2016.
  • [Li et al.2017] Chu-Min Li, Hua Jiang, and Felip Manyà. On minimization of the number of branches in branch-and-bound algorithms for the maximum clique problem. Computers & Operations Research, 84:1–15, 2017.
  • [Liang et al.2016] Jia Hui Liang, Vijay Ganesh, Pascal Poupart, and Krzysztof Czarnecki. Learning rate based branching heuristic for SAT solvers. In SAT, pages 123–140. Springer, 2016.
  • [Loreggia et al.2016] Andrea Loreggia, Yuri Malitsky, Horst Samulowitz, and Vijay A Saraswat. Deep learning for algorithm portfolios. In AAAI, pages 1280–1286, 2016.
  • [Nowak et al.2017] Alex Nowak, Soledad Villar, Afonso S Bandeira, and Joan Bruna.

    A note on learning algorithms for quadratic assignment with graph neural networks.

    arXiv, 2017.
  • [Östergård2002] Patric R.J. Östergård. A fast algorithm for the maximum clique problem. DAM, 120(1):197–207, 2002.
  • [Palla et al.2005] Gergely Palla, Imre Derényi, Illés Farkas, and Tamás Vicsek. Uncovering the overlapping community structure of complex networks in nature and society. Nature, 435(7043):814, 2005.
  • [Papadopoulos et al.2012] Symeon Papadopoulos, Yiannis Kompatsiaris, Athena Vakali, and Ploutarchos Spyridonos. Community detection in social media. DMKD, 24(3):515–554, 2012.
  • [Read and Cressie1988] T. R. C. Read and N. A. C. Cressie. Goodness-of-fit statistics for discrete multivariate data. Springer Series in Statistics, 1988.
  • [Read and Cressie1989] T. Read and N. Cressie. Pearson’s and the likelihood ratio statistic : a comparative review. International Statistical Review, 57(1):19–43, 1989.
  • [Rossi and Ahmed2015] Ryan A. Rossi and Nesreen K. Ahmed. The network data repository with interactive graph analytics and visualization. In AAAI, 2015.
  • [Rossi et al.2015] Ryan A Rossi, David F Gleich, and Assefaw H Gebremedhin. Parallel maximum clique algorithms with applications to network analysis. SISC, 37(5):C589–C616, 2015.
  • [San Segundo et al.2016] Pablo San Segundo, Alvaro Lopez, and Panos M Pardalos. A new exact maximum clique algorithm for large and massive sparse graphs. Computers & Operations Research, 66:81–94, 2016.
  • [Stix2004] V. Stix. Finding all maximal cliques in dynamic graphs. Computational Optimization and applications, 27:173–186, 2004.
  • [Vinyals et al.2015] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In NIPS, pages 2692–2700, 2015.
  • [Wang et al.2014] Chengwei Wang, Karsten Schwan, Brian Laub, Mukil Kesavan, and Ada Gavrilovska. Exploring graph analytics for cloud troubleshooting. In ICAC, pages 65–71, 2014.
  • [Welsh and Powell1967] Dominic JA Welsh and Martin B Powell. An upper bound for the chromatic number of a graph and its application to timetabling problems. The Computer Journal, 10(1):85–86, 1967.
  • [Yang et al.2016] Lei Yang, Jiannong Cao, Shaojie Tang, Di Han, and Neeraj Suri. Run time application repartitioning in dynamic mobile cloud environments. TCC, 4(3):336–348, 2016.
  • [Yao et al.2013] Yan Yao, Jian Cao, and Minglu Li. A network-aware virtual machine allocation in cloud datacenter. In NPC, pages 71–82. Springer, 2013.
  • [Yeger-Lotem et al.2004] Esti Yeger-Lotem, Shmuel Sattath, Nadav Kashtan, Shalev Itzkovitz, Ron Milo, Ron Y Pinter, Uri Alon, and Hanah Margalit. Network motifs in integrated cellular networks of transcription–regulation and protein–protein interaction. PNAS, 101(16):5934–5939, 2004.
  • [Zuckerman2006] David Zuckerman. Linear degree extractors and the inapproximability of max clique and chromatic number. In STOC, pages 681–690. ACM, 2006.