A Note on Learning Algorithms for Quadratic Assignment with Graph Neural Networks

06/22/2017 ∙ by Alex Nowak, et al. ∙ 0

Many inverse problems are formulated as optimization problems over certain appropriate input distributions. Recently, there has been a growing interest in understanding the computational hardness of these optimization problems, not only in the worst case, but in an average-complexity sense under this same input distribution. In this note, we are interested in studying another aspect of hardness, related to the ability to learn how to solve a problem by simply observing a collection of previously solved instances. These are used to supervise the training of an appropriate predictive model that parametrizes a broad class of algorithms, with the hope that the resulting "algorithm" will provide good accuracy-complexity tradeoffs in the average sense. We illustrate this setup on the Quadratic Assignment Problem, a fundamental problem in Network Science. We observe that data-driven models based on Graph Neural Networks offer intriguingly good performance, even in regimes where standard relaxation based techniques appear to suffer.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many tasks, spanning from discrete geometry to statistics, are defined in terms of computationally hard

optimization problems. Loosely speaking, computational hardness appears when the algorithms to compute the optimum solution scale poorly with the problem size, say faster than any polynomial. For instance, in high-dimensional statistics we may be interested in the task of estimating a given object from noisy measurements under a certain generative model. In that case, the notion of hardness contains both a statistical aspect, that asks above which signal-to-noise ratio the estimation is feasible, and a computational one, that restricts the estimation to be computed in polynomial time. An active research area in Theoretical Computer Science and Statistics is to understand the interplay between those statistical and computational detection thresholds; see 

[1] and references therein for an instance of this program in the community detection problem, or [3, 7, 4] for examples of statistical inference tradeoffs under computational constraints.

Instead of investigating a designed algorithm for the problem in question, we consider a data-driven approach to learn algorithms from solved instances of the problem. In other words, given a collection of problem instances drawn from a certain distribution, we ask whether one can learn an algorithm that achieves good accuracy at solving new instances of the same problem – also being drawn from the same distribution, and to what extent the resulting algorithm can reach those statistical/computational thresholds.

The general approach is to cast an ensemble of algorithms as neural networks with specific architectures that encode prior knowledge on the algorithmic class, parameterized by . The network is trained to minimize the empirical loss , for a given measure of error

, using stochastic gradient descent. This leads to yet another notion of

learnability hardness, that measures to what extent the problem can be solved with no prior knowledge of the specific algorithm to solve it, but only a vague idea of which operations it should involve.

In this revised version of  [20] we focus on a particular NP-hard problem, the Quadratic Assignment Problem (QAP), and study data-driven approximations to solve it. Since the problem is naturally formulated in terms of graphs, we consider the so-called Graph Neural Network (GNN) model [27]. This neural network alternates between applying linear combinations of local graph operators – such as the graph adjacency or the graph Laplacian, and pointwise non-linearities, and has the ability to model some forms of non-linear message passing and spectral analysis, as illustrated for instance in the data-driven Community Detection methods in the Stochastic Block Model [6]. Existing tractable algorithms for the QAP include spectral alignment methods [30] and methods based on semidefinite programming relaxations [33, 10]. Our preliminary experiments suggest that the GNN approach taken here may be able to outperform the spectral and SDP counterparts on certain random graph models, at a lower computational budget. We also provide an initial analysis of the learnability hardness by studying the optimization landscape of a simplified GNN architecture. Our setup reveals that, for the QAP, the landscape complexity is controlled by the same concentration of measure phenomena that controls the statistical hardness; see Section 4.

The rest of the paper is structured as follows. Section 2 presents the problem set-up and describes existing relaxations of the QAP. Section 3 describes the graph neural network architecture, Section 4 presents our landscape analysis, and Section 5 presents our numerical experiments. Finally, Section 6 describes some open research directions motivated by our initial findings.

2 Quadratic Assignment Problem

Quadratic assignment is a classical problem in combinatorial optimization. For

symmetric matrices it can be expressed as

(1)

where is the set of permutation matrices of size . Many combinatorial optimization problems can be formulated as quadratic assignment. For instance, the network alignment problem consists on given and the adjacency graph of two networks, to find the best matching between them, i.e.:

(2)

By expanding the square in (2) one can obtain an equivalent optimization of the form (1). The value of (2) is 0 if and only if the graphs and are isomorphic. The minimum bisection problem can also be formulated as a QAP. This problem asks, given a graph , to partition it in two equal sized subsets such that the number of edges across partitions is minimized. This problem, which is natural to consider in community detection, can be expressed as finding the best matching between , a graph with two equal sized disconnected cliques, and .

The quadratic assignment problem is known to be NP-hard and also hard to approximate [24]

. Several algorithms and heuristics had been proposed to address the QAP with different level of success depending on properties of

and  [11, 19, 23]. We refer the reader to [9] for a recent review of different methods and numerical comparison. According to the experiments performed in [9] the most accurate algorithm for recovering the best alignment between two networks in the distributions of problem instances considered below is a semidefinite programming relaxation (SDP) first proposed in [33]. However, such relaxation requires to ‘lift’ the variable to an matrix and solve an SDP that becomes practically intractable for . Recent work [10] has further relaxed the semidefinite formulation to reduce the complexity by a factor of , and proposed an augmented lagrangian alternative to the SDP which is significantly faster but not as accurate, and it consists of an optimization algorithm with variables.

There are known examples where the SDP is not able to prove that two non-isomorphic graphs are actually not isomorphic (i.e. the SDP produces pseudo-solutions that achieve the same objective value as an isomorphism but that do not correspond to permutations [21, 32]

). Such adverse example consists on highly regular graphs whose spectrum have repeated eigenvalues, so-called

unfriendly graphs [2]. We find QAP to be a good case study for our investigations for two reasons. It is a problem that is known to be NP-hard but for which there are natural statistical models of inputs, such as models where one of the graphs is a relabelled small random perturbation of the other, on which the problem is believed to be tractable. On the other hand, producing algorithms capable of achieving this task for large perturbations appears to be difficult. It is worth noting that, for statistical models of this sort, when seen as inverse problems, the regimes on which the problem of recovering the original labeling is possible, impossible, or possible but potentially computationally hard are not fully understood.

3 Graph Neural Networks

The Graph Neural Network, introduced in [27] and further simplified in [18, 8, 28] is a neural network architecture based on local operators of a graph , offering a powerful balance between expressivity and sample complexity; see [5] for a recent survey on models and applications of deep learning on graphs.

Given an input signal on the vertices of , we consider graph intrinsic linear operators that act locally on this signal: The adjacency operator is the map where with iff . The degree operator is the diagonal linear map Similarly, -th powers of , encode -hop neighborhoods of each node, and allow us to aggregate local information at different scales, which is useful in regular graphs. We also include the average operator

, which allows to broadcast information globally at each layer, thus giving the GNN the ability to recover average degrees, or more generally moments of local graph properties. By denoting

the generator family, a GNN layer receives as input a signal and produces as

(3)

where , , are trainable parameters and is a point-wise non-linearity, chosen in this work to be for the first output coordinates and

for the rest. We consider thus a layer with concatenated “residual connections

[13], to both ease with the optimization when using large number of layers, but also to give the model the ability to perform power iterations. Since the spectral radius of the learned linear operators in (3

) can grow as the optimization progresses, the cascade of GNN layers can become unstable to training. In order to mitigate this effect, we use spatial batch normalization

[14] at each layer. The network depth is chosen to be of the order of the graph diameter, so that all nodes obtain information from the entire graph. In sparse graphs with small diameter, this architecture offers excellent scalability and computational complexity.

Cascading layers of the form (3) gives us the ability to approximate a broad family of graph inference algorithms, including some forms of spectral estimation. Indeed, power iterations are recovered by bypassing the nonlinear components and sharing the parameters across the layers. Some authors have observed [12] that GNNs are akin to message passing algorithms, although the formal connection has not been established. GNNs also offer natural generalizations to process high-order interactions, for instance using graph hierarchies such as line graphs [6]

or using tensorized representations of the permutation group

[17], but these are out of the scope of this note.

The choice of graph generators encodes prior information on the nature of the estimation task. For instance, in the community detection task, the choice of generators is motivated by a model from Statistical Physics, the Bethe free energy [26, 6]. In the QAP, one needs generators that are able to detect distinctive and stable local structures. Multiplicity of the spectrum of the graph adjacency operator is related to the (un)effectiveness of certain relaxations [2, 19] (the so-called (un)friendly graphs), suggesting that generator families that contain non-commutative operators may be more robust on such examples.

4 Landscape of Optimization

In this section we sketch a research direction to study the landscape of the optimization problem by studying a simplified setup. Let us assume that is the adjacency matrix of a random weighted graph (symmetric) from some distribution , and let where is a permutation matrix and represents a symmetric noise matrix.

For simplicity let’s assume that the optimization problem consists in finding a polynomial operator of degree . If denotes the adjacency matrix of a graph, then its embedding is defined as

and such that is a random

matrix with independent Gaussian entries with variance

and is the -th column of . Each column of

is thought as a random initialization vector for an iteration resembling a power method. This model is a simplified version of (

3) where is the identity and . Note that so it suffices to analyze the case where .

We consider the following loss

Since are symmetric we can consider an eigenbasis for with respective eigenvalues and an eigenbasis for , with eigenvalues . Decompose . Observe that

where

Note we can also consider a symmetric version of that leads to the same quadratic form. We observe that

(4)

For fixed (let’s not consider the expected value for a moment), the quotient of two quadratic forms (where is a positive definite with Cholesky decomposition

) is maximized by the top eigenvector of

.

Say that is appropriately normalized (for instance let’s say with a Wigner matrix, i.e. a random symmetric matrix with i.i.d. normalized gaussian entries). When tends to infinity the denominator of (4) rapidly concentrates around its expected value

where the convergence is almost surely according to the semicircle law (see for instance [29]).

If for large enough

, due to concentration of measure, with high probability we have

and similarly for (assuming that the noise level in is small enough to allow concentration). Intuitively we expect that the denominator concentrates faster than the numerator due to the latter’s dependency on the inner products . A formal argument is needed to make this statement precise. With this in mind we have

If , since the denominator is fixed one can use the Cholesky decomposition of to find the solution, similarly to the case where and are fixed. In this case, the critical points are in fact the eigenvectors of , and one can show that has no poor local minima, since it amounts to optimizing a quadratic form on the sphere. When , one could expect that the concentration of the denominator around its mean somewhat controls the presence of poor local minima of . An initial approach is to use this concentration to bound the distance between the critical points of and those of the ‘mean field’ equivalent

If we denote , we have

A more rigorous analysis shows that both terms in the upper bound tend to zero as grows due to the fast concentration of around its mean . Another possibility is to rely instead on topological tools such as those developed in [31] to control the presence of energy barriers as a function of .

5 Numerical Experiments

Figure 1: Comparison of recovery rates for the SDP [25], LowRankAlign [9] and our data-driven GNN, for the Erdos-Renyi model (left) and random regular graphs (right). All graphs have nodes and edge density . The recovery rate is measured as the average number of matched nodes from the ground truth. Experiments have been repeated 100 times for every noise level except the SDP, which have been repeated 5 times due to its high computational complexity.

We consider the GNN and train it to solve random planted problem instances of the QAP. Given a pair of graphs with nodes each, we consider a siamese GNN encoder producing normalized embeddings . Those embeddings are used to predict a matching as follows. We first compute the outer product

, that we then map to a stochastic matrix by taking the softmax along each row/column. Finally, we use standard cross-entropy loss to predict the corresponding permutation index. We perform experiments of the proposed data-driven model for the graph matching problem

111Code available at https://github.com/alexnowakvila/QAP_pt. Models are trained using Adamax [16] with and batches of size 32. We note that the complexity of this algorithm is at most .

5.1 Matching Erdos-Renyi Graphs

In this experiment, we consider to be a random Erdos-Renyi graph with edge density . The graph is a small perturbation of according to the following error model considered in [9]:

(5)

where and

are binary random matrices whose entries are drawn from i.i.d. Bernoulli distributions such that

and with . The choice of guarantees that the expected degrees of and are the same. We train a GNN with 20 layers and 20 feature maps per layer on a data set of 20k examples. We fix the input embeddings to be the degree of the corresponding node. In Figure 1 we report its performance in comparison with the SDP from [25] and the LowRankAlign method from [9].

5.2 Matching Random Regular Graphs

Regular graphs are an interesting example because they tend to be considered harder to align due to their more symmetric structure. Following the same experimental setup as in [9], is a random regular graph generated using the method from [15] and is a perturbation of according to the noise model (5). Although is in general not a regular graph, the “signal” to be matched to, , is a regular graph. Figure 1 shows that in that case, the GNN is able to extract stable and distinctive features, outperforming the non-trainable alternatives. We used the same architecture as 5.1, but now, due to the constant node degree, the embeddings are initialized with the 2-hop degree.

6 Discussion

Problems are often labeled to be as hard as their hardest instance. However, many hard problems can be efficiently solved for a large subset of inputs. This note attempts to learn an algorithm for QAP from solved problem instances drawn from a distribution of inputs. The algorithm’s effectiveness is evaluated by investigating how well it works on new inputs from the same distribution. This can be seen as a general approach and not restricted to QAP. In fact, another notable example is the community detection problem under the Stochastic Block Model (SBM) [6]. That problem is another particularly good case of study because there exists very precise predictions for the regimes where the recovery problem is (i) impossible, (ii) possible and efficiently solvable, or (iii) believed that even though possible, may not be solvable in polynomial time.

If one believes that a problem is computationally hard for most instances in a certain regime, then this would mean that no choice of parameters for the GNN could give a good algorithm. However, even when there exist efficient algorithms to solve the problem, it does not mean necessarily that an algorithm will exist that is expressible by a GNN. On top of all of this, even if such an algorithm exists, it is not clear whether it can be learned with Stochastic Gradient Descent on a loss function that simply compares with known solved instances. However, experiments in 

[6] suggest that GNNs are capable of learning algorithms for community detection under the SBM essentially up to optimal thresholds, when the number of communities is small. Experiments for larger number of communities show that GNN models are currently unable to outperform Belief-Propagation, a baseline tractable estimation that achieves optimal detection up to the computational threshold. Whereas this preliminary result is inconclusive, it may guide future research attempting to elucidate whether the limiting factor is indeed a gap between statistical and computational threshold, or between learnable and computational thresholds.

The performance of these algorithms depends on which operators are used in the GNN. Adjacency matrices and Laplacians are natural choices for the types of problem we considered, but different problems may require different sets of operators. A natural question is to find a principled way of choosing the operators, possibly querying graph hierarchies. Going back to QAP, it would be interesting to understand the limits of this problem, both statistically [22], but also computationally. In particular the authors would like to better understand the limits of the GNN approach and more generally of any approach that first embeds the graphs, and then does linear assignment.

In general, understanding whether the regimes for which GNNs produce meaningful algorithms matches believed computational thresholds for some classes of problems is, in our opinion, a thrilling research direction. It is worth noting that this approach has the advantage that the algorithms are learned automatically. However, they may not generalize in the sense that if the GNN is trained with examples below a certain input size, it is not clear that it will be able to interpret much larger inputs, that may need larger networks. This question requires non-asymptotic deviation bounds, which are for example well-understood on problems such as the ‘planted clique’.

References