Let be a graph, where is a set of vertices and is a set of edges. A clique in is a subset of , where each vertex is connected to all the other vertices in the subset. The problems of finding, or even determining, the size of the maximum clique in a graph, are fundamental problems in theoretical computer science and graph theory, and are known to be NP-hard . Let denote the collection of random (undirected) Erdős-Rényi graphs, each of which is a graph with vertices, such that each edge
exists in the graph with probability, independent on all other edges111
We also consider the case of directed graphs, with the same notation, in which each edge is an ordered pairand exists with probability independent on all other edges.. Asymptotically almost surely (i.e. with probability that approaches 1 as the size of the graph, denoted by , tends to infinity), the size of the maximum clique in a graph is roughly , however no polynomial time algorithm is known to find a clique of size . Karp  suggested the problem of finding a clique of size in a for any in polynomial time.
We focus on a tad easier problem: the planted, or hidden clique problem, suggested independently by Jerrum  and Kučera . Let denote the collection of graphs, where we select random vertices and replace the subgraph induced by them with a clique. The problem is to find a polynomial time algorithm that, given a graph (where is known), recovers the clique. Most algorithms were designed for , but could easily be extended to more general for some constant .
The planted clique and related problems have important applications in a variety of areas, including community detection , molecular biology , motif discovery in biological networks [7, 8], computing the Nash equilibrium [9, 10], property testing 
, sparse principal component analysis, compressed sensing , cryptography [14, 15], and even mathematical finance .
A simple quasi-polynomial time algorithm that recovers maximum cliques of any size is to enumerate subsets of size ; for each subset that forms a clique, take all common neighbors of the subset; one of these will be the planted clique. This is also the fastest known algorithm for any , where .
There are also polynomial time algorithms, but up to now they are able to recover only cliques of size at least . Kučera  observed that when for an appropriate constant , the vertices of the planted clique would almost surely be the ones with the largest degrees in the graph, therefore this case is easy. Alon, et al.  were the first to present a polynomial time algorithm that finds, asymptotically almost surely, cliques of size , where is sufficiently large222They also suggested a polynomial time algorithm to find planted cliques of size for any , which are polynomial time algorithms with degree that grows as decreases. Our comparison of values here is between algorithms with computation time at most., using spectral techniques. Many other algorithms further improved . In parallel, lower bounds for clique sizes able to be found using some families of algorithms, have been proven over the years. The best known polynomial time algorithm for recovering cliques, to the best of our knowledge, is the message-passing algorithm , which has a theoretical bound of .
In the regime , various computational barriers have been established for the success of certain classes of polynomial-time algorithms, such as the Sum of Squares Hierarchy , the Metropolis Process  and statistical query algorithms . Gamarnik and Zadik 
have presented evidence of a phase transition for the presence of the Overlap Gap Property (OGP), a concept known to suggest algorithmic hardness when it appears, at. Moreover, the assumption that detecting a planted clique of size in polynomial time is impossible has become a key assumption in obtaining computational lower bounds for many other problems [5, 12, 22, 23]. However, no general algorithmic barrier such as NP-hardness has been proven for recovering the clique when .
Here, we show a novel graph convolution network-based algorithm to recover cliques of sizes , with similar performance as . To the best of our knowledge, there is no modern machine learning tool to detect planted cliques. We show that the algorithm works as well for recovering planted dense subgraphs other than cliques (e.g. -plexes and for ), for directed graphs and for if is a constant other than . We thus produce state of the art accuracy for planted clique recovery and enlarge the planted clique problem to the dense subgraph recovery problems.
I-a Related Work
The planted clique problem has been extensively studied. We discuss here algorithms relevant to our work.
As mentioned above, the first polynomial time algorithm was introduced by Alon et al. , for cliques of sizes where is large enough. In a variant of this algorithm333This variation is used in  . The original variation uses the plain adjacency matrix and its second eigenvector.
. The original variation uses the plain adjacency matrix and its second eigenvector., one can use the following representation of the adjacency matrix:
where is a positive function of , to adjust this algorithm, originally proposed for , to any . If
is large enough, then the clique can be recovered using the entries of the eigenvector corresponding to the largest eigenvalue of. Selecting the largest entries (by absolute value) of this eigenvector yields a subset of vertices, that includes at least half the vertices of the planted clique; the other half can be subsequently identified through the following simple procedure - choosing the vertices connected to at least vertices in the previous subset.
Feige and Ron  used a method of iterative removal of vertices, based on values related to the degrees of each vertex. With complexity, it succeeds with probability when , for sufficiently large (unspecified) .
Ames and Vavasis  studied a convex optimization formulation in which the graph’s adjacency matrix was approximated by a sparse low-rank matrix. These two objectives (sparsity and rank) are convexified through the usual
-norm and nuclear-norm (i.e. sum of the singular values of the matrix) relaxations. They proved that this convex relaxation approach is successful asymptotically almost surely, providedfor a constant unspecified in the paper.
is large enough, by maximizing a function of orthogonal vectors corresponding to each vertex, which is an approximation of the Lovász theta function of the graph. Jethavaet al. 
then solved analytically the Support Vector Machine (SVM) problem to approximate the Lovász theta function. Their algorithm works for graphs with edge probabilityand clique sizes .
Dekel et al.  improved , to recover cliques of sizes with time complexity of . Their algorithm contains three phases: First, they create a sequence of subgraphs of the input graphs , , where is created from based on the degrees of vertices in . Then, they find the subset of the clique contained in , denoted by . In the last phase, they use the subgraph of induced by the vertices of and their neighbors, then the candidate clique vertices are the vertices there with the highest degrees.
Deshpande and Montanari  used a belief propagation algorithm, also known as (approximate) message-passing. This algorithm is, to the best of our knowledge, the best proven polynomial time algorithm, recovering cliques as small as in theory, with time complexity of . The main idea of this algorithm is passing ”messages” between vertices and their neighbors, normalizing and repeating until the messages on each vertex converge to the probability that the vertex belongs to the clique. In more detail, their message passing algorithm includes two main phases. The first is the message passing phase described above, after which one receives the likelihood for each vertex to belong to the clique. One builds a set of candidate vertices that are the most likely to belong to the clique, but this set, although containing many clique vertices, is much larger than . Therefore, a second phase of cleaning is applied, where they apply an algorithm similar to  to recover the clique. The concept of belief propagation was used in other tasks beyond the hidden clique recovery, such as dense subgraph recovery  graph similarity and subgraph matching .
The algorithm of Deshpande and Montanari was extended through statistical physics tools by Angelini . First, Angelini computed the Bethe free energy associated to the solution of the belief propagation algorithm. Then, the range of clique sizes was separated into intervals where the algorithm can or cannot be used for solving the problem, by analyzing the minima of the free energy function. Lastly, an algorithm of Parallel Tempering was applied to find the cliques where belief propagation struggles to do so. Although the Parallel Tempering algorithm recovers cliques smaller than , the algorithm does so with higher time complexity444For instance, according to the values found in the paper, the time complexity when is proportional to , longer by orders of magnitude than existing algorithms (including the algorithm we propose)., that diverges (even in the average case) as the clique size approaches the threshold size of cliques in graphs.
Marino and Kirkpatrick  further improved a greedy local search algorithm. Using search algorithms of at least computation time, they greedily search for a clique of size larger than the size of a typical maximum clique in a . Then, they apply a cleanup operation (which is used as an early stopping mechanism) to recover the full clique from the smaller one. They assume having a hint vertex (i.e. a vertex known to belong to the clique) in their problem, by which they shrink their search space and their computation time, and show that they can recover cliques smaller than in polynomial time in the presence of hint vertices.
Ii-a Algorithm outline
We follow the methods above and use a two stage approach. At the first stage, we detect a subset of vertices belonging, with a high probability, to the planted subgraph. At the second stage, we extend and clean this subset to obtain the full subgraph. We here mainly focus on the first stage, since the second stage is similar to previous methods.
For the first stage, we propose a framework based on machine learning on graphs, which we address by the name PYGON (Planted Subgraph recoverY using Graph cOnvolutional Network)555Our implementation of PYGON may be found at: https://github.com/louzounlab/PYGON.. In short, we produce random realizations of the required planted subgraph recovery task, and train a model based on Graph Convolutional Networks (GCN)  to find the planted subgraph, using topological features of the vertices as an input to the GCN. We then compute the prediction accuracy on new realizations of the planted subgraph task in different random graphs. The advantage of PYGON is that a change in the target subgraph type only leads to a change in the simulated subgraph planting. The weights of the predictor for the presence of such subgraphs are then learned through a machine learning formalism. In the following methods sections, we describe the details of the machine learning algorithm and the input used for each vertex.
Ii-B Model Structure
Given the desired graph size , edge probability and planted subgraph size , we train PYGON using known examples. We generate graphs, for each of which we select a random set of vertices, and set the subgraph induced by them as the pattern we desire666When the subgraph may have more than one possible pattern, e.g. a subgraph, we randomly select a pattern for each graph and plant this pattern in the selected vertices.. The graphs are separated into training and evaluation sets. We label the vertices such that the planted subgraph vertices are labelled 1 and vertices not in the planted subgraph are labelled 0. We then build a model to predict the label of the vertices in all input graphs.
Ii-B2 Features per Vertex
We use an input matrix of vertex features for each graph, , such that is the value of the -th feature of the -th vertex. The features we tested are of relatively low complexity, describe local properties of each vertex and known to be effective in graph topology-based machine learning tasks [35, 36]
. Note that PYGON must receive a feature matrix representing each input graph, including test graphs, but can be used even with no computed features, and instead use an identity matrix as. However, better performance was obtained using calculated features. The features are calculated using .
The features we tried are the following:
Degree - We used this feature in the undirected case only, but for the directed case one can consider taking either the total degree (number of ancestors and descendants), in- or out-degrees, or any combination of them.
3-Motifs - The -th 3-motif value for a vertex is the number of different triplets of vertices to which belongs and that the subgraph induced by them is isomorphic to the -th type of 3-motif. Using a novel method for motif enumeration, implemented in , we can calculate the motifs per vertex with time complexity of , but distributing the computation on a Graphics Processing Unit (GPU) significantly shortens the computation time.
Ii-B3 Modified Adjacency Matrix
From the original adjacency matrix of each graph, we develop a modified matrix with learnable elements, which is used in the learning process - , such that:
Where are coefficients, affecting the weight of information passing through a vertex to its neighbors, to non-adjacent vertices and to be preserved at the vertex itself, respectively. We let and be learned during the training process and keep constant. We initialize the coefficients so that , and .
We multiply by because when taking , the expected number of adjacent vertices is different from the number of disjoint vertices, creating a severe bias in the model (See III-B1 for more information regarding the problem and its solution).
Ii-B4 Graph Convolutional Network
PYGON is based on a multiple-layer Graph Convolutional Network (GCN). A GCN layer  is a graph neural network layer that learns feature representations of vertices, based on their input features and the topology of the graph. Formally, if represents the feature matrix input to the -th layer and the graph is represented by a matrix (usually a modified version of the adjacency matrix), then the GCN layer outputs a new feature representation for the vertices:
is a nonlinear activation function andis the learnable weight matrix of the -th layer.
Ii-B5 Learning Model
PYGON receives as input an initial feature matrix and an adjacency-like matrix , and outputs a prediction vector . The inputs are fed into a GCN layer with a
activation function and then a dropout layer, to receive the input features to the next GCN layer. After the desired number of such layers, the last layer is a GCN with a sigmoid function (without dropout), that outputs. The architecture of the model (PYGON and the extension stage) is presented in Fig. 1.
Ii-C Learning Process
PYGON is trained in epochs. At every epoch, we shuffle the order of graphs and train the model by them, one graph at a time. In other words, the backpropagation training process is done multiple times per epoch, once for each training graph. Our motivation in doing so is similar to the motivation for batched stochastic gradient descent is used in many machine learning tasks: On one hand, each graph alone is a unit made of many vertices, making the training step from it quite robust, and on the other hand, we wanted to make as many ”random” training steps as we can to converge fast.
Ii-C2 Early Stopping
We set a maximal number of epochs to train PYGON, but it can be stopped earlier. During the training process, we evaluated the performance on the evaluation set, and if for 40 epochs the loss on the evaluation set has not decreased below the current minimal loss, the training was stopped. Finally, our trained model is the model in the epoch that gave the best evaluation loss.
Ii-D Hyper-parameter Selection
We tuned the hyper-parameters of PYGON using Neural Network Intelligence (NNI) . NNI helps choosing the best model hyper-parameters out of a specified range of values, based on many trials of training the model. Each performance trial is done using a different hyper-parameter combination. After choosing the ranges of values, the NNI runs PYGON many times, selecting the hyper-parameter combinations in each experiment using a tuner (i.e. a learning tool that explores the hyper-parameter space, to find the hyper-parameter combination that gives the best performance on the model, as reported by the user). Eventually, the process converges to a neighborhood of hyper-parameters, using which PYGON performs best. Obviously the data used for the NNI were not used for reporting the results.
We give an empirical evidence that PYGON can recover various dense subgraphs for sizes as small as . We study not only planted cliques, but also several other dense subgraphs, in both directed and undirected graphs. The formalism proposed for all dense subgraphs is the same, and we propose that PYGON formalism is appropriate for the recovery of any dense subgraph in a . We study the following subgraphs:
Undirected cliques - We choose vertices and draw an edge between every pair of which (if there was no edge connecting this pair of vertices).
Directed Acyclic Cliques (DAC) - Let be a set of vertices in a directed graph. Choose a random order of the vertices: . We call the induced subgraph , that its edges are all the possible edges from a vertex to a vertex where , a Directed Acyclic Clique (DAC). This subgraph is a Directed Acyclic Graph (DAG) and its undirected form it is a clique.
-plexes - A -plex is an undirected graph of size such that every vertex of is connected to at least vertices. Hence, it is a generalization of a -clique, which can also be interpreted as a 1-plex. We study a specific case of -plexes as our planted subgraph: a 2-plex of size , such that each vertex is connected to vertices, maybe except for one vertex which is connected to all other vertices (to solve parity problems).
Bicliques - We plant a complete induced bipartite graph of size , with sides of sizes and . Note that this is not the known problem of recovering a hidden biclique in a random bipartite graph, because our background graph is still an undirected .
Random dense subgraph - Let . We choose a subset of vertices . We build the entire (undirected) graph such that the probability of edges between pairs of vertices from will be , whereas for pairs of vertices that at least one of them is not in , the edge probability will be . The existence of each edge is independent on all the other edges.
For each such graph, we compute the theoretical minimal size of the planted subgraph, that its expected frequency in a would be smaller than 1. These sizes serve as a baseline when comparing our performance with the performance of the existing algorithms. Fig. 2 shows the threshold sizes for the different subgraphs as functions of the graph size and edge probability.
Iii-a Cutoff for detection of planted dense subgraphs
In the following calculations, we denote by
the random variable enumerating the corresponding subgraphs of sizein a . From the linearity of the expectation and the independence of edges, can be calculated as:
is the indicator random variable for the occurrence of the event, the subgraph is denoted by , denotes graph isomorphism, is a random variable denoting an induced subgraph of size and the probability is evaluated with respect to .
The calculations for each subgraph case will rely only on the first moment method, hence insufficient as proof for an exact information-theoretic lower bound. However, it is sufficient for us to provide an intuition on the threshold values. Moreover, one can show that the true cutoff values are of the same order of magnitude as the ones we will present. This can be shown by lettingbe the value we will find in each case, times . One can easily receive that in these cases, , so using Markov’s inequality, the results will be: .
Iii-A1 Undirected Clique
Matula  was the first to show that the threshold function for the size of a maximal clique in an undirected graph is
This is a stronger claim than finding the minimal for which .
Iii-A2 Directed Acyclic Cliques
Here and in the two next cases, we only find the minimal for which . Note that the building process of the DAC indicates the number of possible options to form one in a specific set of vertices, and therefore:
From (7), one can see that the threshold value of is:
We denote the size of the -plex by , we randomly split the vertices we chose in advance into pairs (possibly with one vertex left alone). The number of such splits is , hence
Now, using Stirling’s approximation, and :
As a result, the threshold value of is
For calculating , we will choose the vertices in two steps: first, from all the vertices choose , and then, from the remaining choose the other for the second side. Therefore, up to a constant,
Approximating as before:
And finally, the threshold value of will be
Iii-A5 Dense random subgraphs
Here we plant a subgraph isomorphic to a random in a . In this case, we do not limit ourselves to a specific subgraph pattern, hence the detection threshold for the presence of such a subgraph is not a well-defined question. Instead, we use a slightly different question and take the answer to be our baseline: We will denote here as the number of subgraphs of size with at least edges (assuming ), i.e. a density at least as large as the density a typical is expected to have. The threshold value will be a lower bound for the intuitive cutoff value of finding ”a copy of ” in our graph.
The calculation here considers the number of choices of vertices and the number of choices of edges that should exist between the chosen vertices. We do not care about the other edges in the subgraph. Therefore, the calculation is as follows, and we will use the same approximations we used in the other cases:
Again, the threshold order of magnitude of is logarithmic with :
Iii-B Clique recovery
Iii-B1 In dense graphs, GCN must include penalty for the absence of edges
GCN are often used for the embedding of the graph vertices. To create this embedding, the GCN layer uses both the existing feature representation and the graph topology, input to this layer as and respectively, such that each vertex will be represented as:
Note that the term in the inner parentheses determines what vertices are used in order to build the new representation of each vertex, and how. For example, if is simply (or for directed graphs), then the vertex representation is built from the previous representations of its first neighborhood, and if (or for directed graphs) then the neighbors and the vertex itself are used for the new representation.
GCN are typically used in real-world graphs, which are much sparser than graphs for close to 0.5. Hence, the embedding of vertices in these graphs is built based on a small number of vectors, since the neighborhoods of the vertices are small. However, in dense graphs, the embedding is built based on vertices. In addition, in the task of detecting dense subgraphs, the absence of edges is of great importance, therefore we use with negative weights in entries corresponding to missing edges (i.e. if , then is negative), and the output vector of each vertex is built based on the input vectors of all vertices.
Putting negative weights on missing edges creates a problem: considering a instance for , the GCN layer creates a vector representation that is biased due to the difference between the number of positive and negative contributions. Due to this bias, the training process fails. Moreover, PYGON contains several GCN layers and the bias grows exponentially with the number of layers. This problem can be fixed by multiplying the weights of the positive edges by a constant , such that for a random vertex :
where is the neighborhood of , leading to: , as in (2c). Setting the initial edge weights by sets an unbiased starting point, from which the model can be trained to adapt to the task in hand.
Fig. 3 shows the importance of using this correction. The expected bias: grows with the deviation from , and the performance of PYGON without the correction decreases dramatically, since training the model becomes practically impossible.
In every test of PYGON, we tried to find a planted subgraph of known pattern and size777The values of we used were large enough comparing to the cutoff value, so that the planted subgraphs were almost surely the only subgraphs of that patterns and sizes., in graphs with specific, known . We initialized 20 random realizations (i.e. 20 graphs with planted subgraphs of the desired size) per instance, separated them into 5 equal folds and ran 5 cross-validations, such that every run, 3 folds (i.e. 12 graphs) were used as training graphs, 1 as validation and 1 as test. The performance was measured by averaging the desired measurement on the 20 graphs when they were used as test graphs.
Iii-B2 Performance analysis
PYGON outputs a test score for each vertex in the test graphs. Using this score, one can order the vertices from each graph, where the vertices that form the planted clique should be the top vertices (the vertices with the highest scores, from this model). Choosing the top score vertices is the step before the cleaning/extending algorithm, that should give the final set of vertices.
We compared the results of PYGON to the results of . Although  shows better performance, we do not use a hint vertex as they do, therefore they have a crucial information we do not have. Furthermore, they apply search algorithms with time complexity of at least , while the other existing algorithms are with time complexity of and . Our main performance tests will be using an algorithm of time complexity, making the comparison to  even more unfair.
To compare the performance of PYGON to  in clique recovery, we generated instances of graphs of sizes 128 to 8192 with equal spaces in logarithmic scale (base 2). We looked for the value for which PYGON recovers 50% of the planted clique, out of the vertices with the highest scores. The hyper-parameters for our model are specified in Appendix A. The results are shown in Fig. 4. A simple regression shows that PYGON can recover cliques of sizes as low as using 3-motifs as the initial features, and as low as using the degree as the initial feature. The reason why we chose and top 50% is that using a cleaning procedure similar to , we have reached success rates of roughly 50% as well.
Iii-C Recovery of other dense subgraphs
As mentioned previously, PYGON is capable of recovering various dense subgraphs without adaptations for specific subgraphs. Using the exact same framework (only trained on the relevant subgraph each time) and hyper-parameters, we tested the threshold value of for which we find at least half the subgraph in the top vertices by model scores, for the subgraphs presented above. The results are shown in Fig. 5, where we present PYGON performance with either 3-motifs or degrees as the initial features. Indeed, PYGON can recover the subgraphs presented above, for sizes , namely the performance of PYGON is not affected by the type of the subgraph required to recover.
As no other known algorithm can perform on both cliques and other subgraphs, we also compare the performance of PYGON to other clique recovery algorithms (Alon et al. , Dekel et al.  and Deshpande and Montanari ). In Fig. 6, one can see that the only model which is not affected by structure of the planted subgraph and by the question whether the graph is directed is PYGON.
Iii-D Time complexity
The theoretical time complexity of our algorithm is affected mainly by the feature calculations and GCN layers. As shown in II-B2, the time complexity of calculating 3-motifs as initial features is , whereas using degrees as initial features, the complexity decreases to . According to , the time complexity of a GCN layer is where , are the input and output feature dimensions, respectively. Therefore, the total time complexity of PYGON with 3-motifs as initial features is , and with degrees as initial features, the complexity is .
Recall that the time complexity of PYGON with degrees as initial features is of the same order of magnitude as of , and is lower than of . Even in case we use 3-motifs as initial features, the use of GPU in our algorithm (in both feature calculations and in neural networks) shortens the computation time substantially, such that the running times of PYGON and some existing methods are similar (42 sec for DM vs less than 50 for PYGON, for a test of 20 graphs). Note that PYGON still requires more time to train, a stage that tends to be longer than the other algorithms, however once the model is trained, any new graph can be fed into the trained model without additional actions.
As mentioned in II-B2, one can choose using different initial features, including the option not to calculate features at all (in this case, each vertex is represented initially by a one-hot vector). The option of using cheap features should be considered, since the feature calculations take the majority of the computation time PYGON needs (especially when using features costing in time). Fig. 7 shows the performance of PYGON using different sets of initial features. One can see that the initial features affect the performance, although not by much, hence saving time using PYGON with less expensive initial features (e.g. degrees only, or scores from few iterations of belief propagation) is possible.
Iii-E Extension of subgraph candidates to full subgraph
In many existing algorithms, the last one or few stages are intended to take the remaining set, in which are many vertices that belong to the planted subgraph, clean and extend it to recover the whole subgraph. Using the outputs of PYGON, one can apply a similar stage, with any desired cleaning algorithm, to find the planted subgraph.
In our use, we took the set of candidates before cleaning to be the vertices that received the highest scores from PYGON. The selection of vertices has two reasons. On one hand, selecting too many vertices increases the computational cost of the cleaning algorithm. On the other hand, selecting too few vertices has a risk of losing too much signal, causing the algorithm to fail in finding the entire subgraph. We tested and found that the appropriate size of set of candidates for the algorithm we used is .
Specifically, we used the methods described in Algorithm 1, which is similar to the cleaning algorithm of , on the task of recovering the clique. This algorithm is applied graph-wise, independent on all other graphs. Fig. 8 shows the recovery rate of the algorithm used on the outputs of PYGON, comparing to the percentages of vertices that belong to the clique and caught by PYGON. One can see that the performances are very similar, and that indeed when PYGON succeeds to find at least half the clique among the proposed candidates, the cleaning algorithm shows almost the exact same performance on the second task of fully recovering the clique.
We presented PYGON - a novel method to recover planted dense subgraphs in graphs. This is the first method to use advanced machine learning tools on this problem. The framework of PYGON makes use of known elements from the subject of machine learning, particularly on graphs, all combined to a model for recovering subgraphs. The novelty of this approach is the new point of view on the problem of recovering subgraphs – the model is trained on several realizations of the problem888The training on similar realizations of the problem is crucial. When training PYGON using realizations of one problem and testing on a different problem, the performance will decrease significantly, comparing to training and testing on the same problem. with known planted subgraphs, creating a tool that can help recovering planted subgraphs in new realizations.
The learning process of PYGON involves the commonly used GCN layers, which pass messages through the graph to embed each vertex by the topology of the graph, and predict the probability for each vertex to belong to the planted subgraph. The idea of passing messages between vertices had already been used in , but there are two main differences between their algorithm and ours: First, PYGON uses a limited fixed number of message passes (i.e. a finite number of GCN layers), and second, the messages we passed are not constant from one pass to another (not even in the embedding dimension between iterations). We have shown that, despite passing messages for fewer iterations and using an algorithm of lower time complexity, our performance is close to the performance of .
Although the focus here was on the planted clique problem, PYGON works for different planted subgraphs, in both directed and undirected graphs, unlike any other known algorithm. Moreover, the existing methods are built to detect density, but not a specific structure of a planted subgraph, at least in the initial recovery of the subgraph core. The learning approach has helped us recovering subgraphs of specific structures, with the same orders of magnitude as the cliques we had been able to recover.
However, PYGON is not built to solve some other known subgraph recovery problems, such as recovering a planted subgraph in a bipartite graph (e.g. the planted biclique problem), the k-densest subgraph, the maximal common subgraph and recovering a planted non-dense specific subgraph. In addition, we have tried PYGON only in cases where a single subgraph had been planted, and did not address cases of existing hint vertices, such as in . We believe that the learning approach can be extended to most of these problems, if not all of them, and may be able to help in the cases of multiple planted subgraphs or known partial information on the planted subgraph(s). Such extensions would require changing the structure of the learning layers (which are currently a specific implementation of a GCN layer).
We have shown that PYGON, like all other existing approaches, does not break through the threshold of , for any of the dense subgraphs, further suggesting that recovering planted cliques of smaller sizes might be impossible in polynomial time. We propose the following conjecture:
Let be a graph, for independent on , where we plant a subgraph of size . Then, if is of size smaller than , then there is no PAC-learning algorithm in a fixed degree polynomial time, that can detect the planted subgraph.
Roughly speaking, our conjecture is that for a planted subgraph of size smaller than
, no fixed degree polynomial time algorithm can learn from any number of examples, and output (with a large probability) a function that classifies whether each vertex belongs to the subgraph (with a small average error).
Lastly, we used here 3-motifs as input features. One may think of this usage as a step further from Kučera’s  use of degrees (i.e. 2-motifs) to detect cliques of size . We believe that for planted cliques and possibly other subgraphs, of any known size larger than the detection threshold of the subgraph, one can find an appropriate number (significantly smaller than , but probably dependent on ), such that the use of -motifs can help recovering the planted subgraph with high probability and in quasi-polynomial time, shorter than the known exhaustive search algorithm. For now, this question is left unanswered.
Appendix A Model Hyper-Parameters
The following PYGON hyper-parameters were found by NNI :
In most figures, we take the initial feature matrix to be built from 3-motifs (hence, for undirected graphs, it has 2 columns, and for directed graphs, it has 6 columns). Several specific figures also include PYGON models with only degrees as the input features. Either input matrix is normalized by taking a logarithm (more precisely,
) and then z-scoring over the columns (on all the graphs together). The learned model is composed of 5 GCN layers, with hidden layers of sizes 225, 175, 400 and 150 (the input dimension is the number of features and the output dimension is 1). The dropout rate is 0.4. The model is trained for 1000 epochs at most (due to early stopping, the training process stops much earlier, after 250 epochs at most). The optimizer is ADAM (with the default hyper-parameters of the Pytorch implementation), with learning rate of 0.005 andregularization coefficient of 0.0005.
Appendix B Best tested loss function definition
In order to try and find a preferable loss function, we used a more general definition of loss function as the function to minimize in the training process of PYGON:
The first term is a weighted binary cross-entropy function. We used weights to correct the severe imbalance between the positive and negative samples (i.e. vertices from the planted subgraph and vertices that do not belong to the planted subgraph, respectively).
The two last terms are additions to the loss function as in , specifically for the task of recovering cliques. The pairwise loss forces that two model scores for non-adjacent vertices will never be both 1, and that two adjacent vertices will both have high scores (though the penalty of the latter is not as high as the former). The binomial regularization keeps the number of vertices predicted to be clique vertices from being too large.
In PYGON, the pairwise loss and binomial regularization are either not effective or damage the performance, whereas the weighted binary cross-entropy is crucial, as shown in Fig. 9.
Appendix C PYGON with GAT Layer
In order to demonstrate another level of flexibility in PYGON, we will show here that when replacing the learning layer from GCN to Graph Attention Network (GAT) layer, the performance of the algorithm (denoted as PYGAT) is similar. GAT, is a graph neural network layer that receives as input a feature representation of the vertices of a graph, and finds a new embedding of the vertices by message passing. This is done by applying a self-attentional mechanism, to learn the messages passed over each edge of the graph.
Here, we use a specific implementation of this layer, aiming to pass information between existing neighbors, and (different) information between non-adjacent vertices: The input feature matrix is passed through a dropout layer, then a linear layer. Two learnable parameters are matched to each vertex, one for the vertex as a source vertex and one for that vertex as a target. From the inner product of the feature matrix and the learnable parameters, which is concatenated (scores of sources to the scores of targets), passed in LeakyReLU, softmax and then dropout, an attention coefficient is obtained to each edge. Then, the features are multiplied element-wise by the attention coefficients and we sum, for each vertex, over the results of its neighbors, to make the output features. Before passing to the next layer, a skip connection is added to those features.
Our version uses two-headed attention – one based on the original graph, and the other on the complement graph, such that the output features from this layer are a concatenation of the output features from each attention head. Only the output layer is a GAT layer based only on the original graph.
The hyper-parameters we used in PYGAT are the following:
The initial feature matrix is built from either 3-motifs or degrees, normalized exactly like in PYGON. The learned model is made of 4 GAT layers, with hidden layers of sizes 60, 120, 170 and 100. The dropout rate is 0.05. The model is trained for 1000 epochs at most, with the same early stopping mechanism as in PYGON. The optimizer is SGD (with the default hyper-parameters of the Pytorch implementation), with learning rate of 0.13 for the 3-motif runs and 0.175 for the degree runs, and regularization coefficient of 0.0025 for the 3-motif runs and 0.005 for the degree runs.
Fig. 10 shows the performance of PYGON and PYGAT. As one may see, PYGON performs slightly better, however the differences are small.
-  R. M. Karp, “Reducibility among combinatorial problems,” in Complexity of Computer Computations, R. Miller and J. Thatcher, Eds. Plenum Press, 1972, vol. 40, pp. 85–103.
-  ——, “The probabilistic analysis of some combinatorial search algorithms,” in Algorithms and Complexity New directions and recent results, J.B.Traub, Ed. Academic Press, 1976, pp. 1–19.
-  M. Jerrum, “Large cliques elude the metropolis process,” Random Structures & Algorithms, vol. 3, no. 4, pp. 347–359, 1992.
-  L. Kuvcera, “Expected complexity of graph partitioning problems,” Discrete Applied Mathematics, vol. 57, no. 2-3, pp. 193–212, 1995.
-  B. Hajek, Y. Wu, and J. Xu, “Computational lower bounds for community detection on random graphs,” in Conference on Learning Theory, 2015, pp. 899–928.
-  P. A. Pevzner and S.-H. Sze, “Combinatorial approaches to finding subtle signals in dna sequences,” in ISMB, vol. 8, 2000, pp. 269–278.
-  R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon, “Network motifs simple building blocks of complex networks,” Science, vol. 298, no. 5594, pp. 824–827, 2002.
-  H. Javadi and A. Montanari, “A statistical model for motifs detection,” 2015.
-  E. Hazan and R. Krauthgamer, “How hard is it to approximate the best nash equilibrium,” SIAM Journal on Computing, vol. 40, no. 1, pp. 79–91, 2011.
-  P. Austrin, M. Braverman, and E. Chlamt’avc, “Inapproximability of np-complete variants of nash equilibrium,” Theory of Computing, vol. 9, no. 1, pp. 117–142, 2013.
-  N. Alon, A. Andoni, T. Kaufman, K. Matulef, R. Rubinfeld, and N. Xie, “Testing k-wise and almost k-wise independence,” in Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, 2007, pp. 496–505.
-  Q. Berthet and P. Rigollet, “Complexity theoretic lower bounds for sparse principal component detection,” in Conference on Learning Theory, 2013, pp. 1046–1066.
-  P. Koiran and A. Zouzias, “Hidden cliques and the certification of the restricted isometry property,” IEEE transactions on information theory, vol. 60, no. 8, pp. 4999–5006, 2014.
-  A. Juels and M. Peinado, “Hiding cliques for cryptographic security,” Designs, Codes and Cryptography, vol. 20, no. 3, pp. 269–280, 2000.
-  B. Applebaum, B. Barak, and A. Wigderson, “Public-key cryptography from different assumptions,” in Proceedings of the forty-second ACM symposium on Theory of computing, 2010, pp. 171–180.
-  S. Arora, B. Barak, M. Brunnermeier, and R. Ge, “Computational complexity and information asymmetry in financial products,” Communications of the ACM, vol. 54, no. 5, pp. 101–107, 2011.
-  N. Alon, M. Krivelevich, and B. Sudakov, “Finding a large hidden clique in a random graph,” Random Structures & Algorithms, vol. 13, no. 3-4, pp. 457–466, 1998.
-  Y. Deshpande and A. Montanari, “Finding hidden cliques of size in nearly linear time,” Foundations of Computational Mathematics, vol. 15, no. 4, pp. 1069–1128, 2015.
-  B. Barak, S. Hopkins, J. Kelner, P. K. Kothari, A. Moitra, and A. Potechin, “A nearly tight sum-of-squares lower bound for the planted clique problem,” SIAM Journal on Computing, vol. 48, no. 2, pp. 687–735, 2019.
-  V. Feldman, E. Grigorescu, L. Reyzin, S. S. Vempala, and Y. Xiao, “Statistical algorithms and a lower bound for detecting planted clique,” Journal of the ACM (JACM), vol. 64, no. 2, pp. 1–37, 2017.
-  D. Gamarnik and I. Zadik, “The landscape of the planted clique problem dense subgraphs and the overlap gap property,” 2019.
-  M. Brennan, G. Bresler, and W. Huleihel, “Reducibility and computational lower bounds for problems with planted sparse structure,” in Proceedings of Conference On Learning Theory (COLT), vol. 75, 2018, pp. 44–166.
N. Baldin and Q. Berthet, “Optimal link prediction with matrix logistic regression,” 2018.
-  D. Ron and U. Feige, “Finding hidden cliques in linear time,” Discrete Mathematics & Theoretical Computer Science, pp. 189–204, 2010.
-  B. P. Ames and S. A. Vavasis, “Nuclear norm minimization for the planted clique and biclique problems,” Mathematical programming, vol. 129, no. 1, pp. 69–89, 2011.
-  L. Lov’asz, “On the shannon capacity of a graph,” IEEE Transactions on Information theory, vol. 25, no. 1, pp. 1–7, 1979.
-  U. Feige and R. Krauthgamer, “Finding and certifying a large hidden clique in a semirandom graph,” Random Structures & Algorithms, vol. 16, no. 2, pp. 195–208, 2000.
-  V. Jethava, A. Martinsson, C. Bhattacharyya, and D. Dubhashi, “Lov’asz function, svms and finding dense subgraphs,” The Journal of Machine Learning Research, vol. 14, no. 1, pp. 3495–3536, 2013.
-  Y. Dekel, O. Gurel-Gurevich, and Y. Peres, “Finding hidden cliques in linear time with high probability,” Combinatorics, Probability and Computing, vol. 23, no. 1, pp. 29–49, 2014.
-  B. Hajek, Y. Wu, and J. Xu, “Recovering a hidden community beyond the kesten–stigum threshold in time,” Journal of Applied Probability, vol. 55, no. 2, p. 325–352, 2018.
-  D. Koutra, A. Parikh, A. Ramdas, and J. Xiang, “Algorithms for graph similarity and subgraph matching,” in Proc. Ecol. Inference Conf, vol. 17, 2011.
-  A. M. Chiara, “Parallel tempering for the planted clique problem,” Journal of Statistical Mechanics Theory and Experiment, vol. 2018, no. 7, p. 073404, 2018.
-  R. Marino and S. Kirkpatrick, “Revisiting the challenges of maxclique,” 2018.
-  T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in Proceedings of the 6th International Conference on Learning Representations, 2017.
-  R. Abel, I. Benami, and Y. Louzoun, “Topological based classification using graph convolutional networks,” 2019.
-  R. Naaman, K. Cohen, and Y. Louzoun, “Edge sign prediction based on a combination of network structural topology and sign propagation,” Journal of Complex Networks, vol. 7, no. 1, pp. 54–66, 2019.
-  I. Levinas, R. Kuperman, and Y. Louzoun, “Graph-measures package.” [Online]. Available: httpsgithub.comlouzounlabgraph-measures
-  Microsoft Corporation, “Neural network intelligence.” [Online]. Available: httpsgithub.commicrosoftnni
-  D. W. Matula, The Largest Clique Size in a Random Graph. Department of Computer Science, Southern Methodist University, 1976.
-  P. Velivckovi’c, G. Cucurull, A. Casanova, A. Romero, P. Li‘o, and Y. Bengio, “Graph attention networks,” in International Conference on Learning Representations, 2018.