Gumbel-softmax Optimization: A Simple General Framework for Combinatorial Optimization Problems on Graphs

09/16/2019 ∙ by Jing Liu, et al. ∙ Beijing Normal University 0

Many problems in real life can be converted to combinatorial optimization problems (COPs) on graphs, that is to find a best node state configuration or a network structure such that the designed objective function is optimized under some constraints. However, these problems are notorious for their hardness to solve because most of them are NP-hard or NP-complete. Although traditional general methods such as simulated annealing (SA), genetic algorithms (GA) and so forth have been devised to these hard problems, their accuracy and time consumption are not satisfying in practice. In this work, we proposed a simple, fast, and general algorithm framework called Gumbel-softmax Optimization (GSO) for COPs. By introducing Gumbel-softmax technique which is developed in machine learning community, we can optimize the objective function directly by gradient descent algorithm regardless of the discrete nature of variables. We test our algorithm on four different problems including Sherrington-Kirkpatrick (SK) model, maximum independent set (MIS) problem, modularity optimization, and structural optimization problem. High-quality solutions can be obtained with much less time consuming compared to traditional approaches.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Combinatorial optimization problems (COPs) ask for optimizing a given objective function whose domain is a discrete but large configuration space[1]. COPs arise in many disciplines and many real-world optimization problems can be converted to COPs. In computer science, there exist a large number of COPs defined on graphs, e.g., maximal independent set (MIS) and minimum vertex cover (MVC)[1]. In these problems, one is asked to give a largest (or smallest) subset of the graph under some constraints. In statistical physics, finding the ground state configuration of spin glasses model where the energy is minimized is another type of COPs[2]. Many problems of network science can also be treated as COPs, for example, modularity optimization[3] asks to specify which community one belongs to that maximize the modularity value. While, structural optimization asks to reveal the underlying network structure under some constraints. For example, finding the network structure to recover the observed network dynamics if the dynamical rules are given. This is also called network reconstruction problem[4, 5]. In general, the space of possible solutions of mentioned problems is typically very large and grows exponentially with system size, thus impossible to solve by exhaustion.

Many methods have been proposed to solve COPs. Traditional general optimization methods such as simulated annealing (SA)[6], genetic algorithm (GA)[7], extremal optimization (EO) [8]

usually suffer from slow convergence and are limited to system size up to thousand. Although there exist many other heuristic solvers such as local search

[9], they are usually domain-specific and require special knowledge.

Recently a great deal of effort has been devoted to applying machine learning to solve COPs. Khalil et al.[10]

used a combination of reinforcement learning and graph embedding called S2V-DQN to address combinatorial optimization problems on graphs. Li et al.

[11] used graph convolution networks (GCNs)[12]

with classic heuristics and achieved state-of-art performance compared to S2V-DQN. However, these two algorithms all belong to supervised learning, thus contain two stages of problem solving: first training the solver and then testing. Although relatively good solutions can be obtained efficiently, it takes a long time for training the solver and the quality of solutions depends heavily on the quality and the amount of the data for training, which is hardly for large graphs. Some techniques have been developed in reinforcement learning area which seeks to solve the problems directly without training and testing stages. For example, REINFORCE algorithm is a typical gradient estimator for discrete optimization. However, it suffers from high variance and time-consuming problems. Recently reparameterization trick developed in machine learning community such as Gumbel-softmax

[13, 14] provides another approach for differentiable sampling. It allows us to pass gradients through samples directly and this method has not been applied to solving COPs on graphs so far.

In this work, we present a novel optimization method based on Gumbel-softmax, called Gumbel-softmax optimization (CSO) method. Under the naive mean field assumption, we assume that the joint probability of a configuration is factorized by the product of marginal probabilities. The marginal probabilities, which stand for the states of nodes or edges, are parameterized by a set of learnable parameters. We can sample a configuration through Gumbel-softmax and compute the objective function. Then we run backpropagation algorithm and update parameters. Furthermore, GSO can be implemented in parallel easily on GPU to simultaneously solve the optimization problem under different initial conditions, and one solution with the best performance can be selected. We first introduce four different problem in Section

2. Then the details of GSO algorithm is explained in Section 3. In Section 4 we introduce our main results on all four problems, and compare with different benchmark algorithms. The results show that GSO can achieved satisfying solutions and also benefit from time consumption. Finally some concluding remarks are given in Section 5.

2 Preliminary

The combinatorial optimization problems that we concerned can be formulated as follows. Let’s consider variables where and each variable can take discrete values: . The combination of the values for all variables is defined as a configuration or a state . The objective function is a map between the configuration space and the real number field, . Thus, the goal of the combinatorial optimization problem is to find a configuration such that the objective function is minimized, i.e., for all . Usually, a set of constraints must be considered which can be formulated as boolean functions on . Nevertheless, they can be absorbed into the objective functions by introducing the punishment parameter. Therefore, we only consider the optimization problem without constraints in this paper for the simplicity.

Here, we consider two kinds of COPs on graphs: one is node optimization problem, the other is structure optimization problem. The node optimization problem can be stated as follows. Given a graph where is the set of vertices in and is the set of edges. Each vertex is a variable in the optimization.

The structure optimization problem is to find an optimal network structure such the objective function defined on graphs can be optimized. If we treat the adjacency matrix of the graph as the state of the system, and each entry of the matrix is a variable which taking value from , then structural optimization problem is also a COP.

In this work we focus on four typical optimization problems:

Sherrington-Kirkpatrick (SK) model

SK model[15] is a celebrated spin glasses model defined on a complete graph where all vertices are connected. The objective function, or the ground state energy of this model is defined as


where and is the coupling strength between two vertices. We are asked to give a configuration where the ground state energy is minimized. The number of possible configurations is and in fact, finding the ground state configuration of SK model is an NP-hard problem[2].

Maximal Independent Set (MIS)

MIS problem is another well-known NP-hard problem. The reason why we focus on MIS problem is that many other NP-hard problems such as Minimum Vertex Cover (MVC) problem and Maximal Clique (MC) problem can be reduced to an MIS problem. An MIS problem asks to find the largest subset such that no two vertices in are connected by an edge in . The Ising-like objective function consists of two parts:


where and the second term provides an penalty for two connected vertices being chosen.

is the hyperparameter for the strenghth of the punishment. The first term is the number of vertices in


Modularity optimization

Modularity is a graph clustering index for detecting community structure in complex networks[16]. Many modularity optimization methods, e.g., hierarchical agglomeration[17], spectral algorithm[3] and extremal optimization[18] have been proposed and it is suggested that maximizing modularity is strongly NP-complete[19]. In general cases where a graph is partitioned into communities, the objective is to maximize the following modularity :


where is the degree of vertex , , where is the number of communities, if and otherwise.

Structural optimization

The problem of structural optimization is to find an optimal network structure for some objective function defined on a network[20, 21]. This problem can also be considered as a special example of a COP if the element of the adjacency matrix is the optimized variable, and a configuration is the adjacency matrix of a potential graph, where is the number of nodes in the network which is a fixed value.

For example, we can find an optimal network to fit the observed node dynamics, in which, we are asked to optimize the network structure to match the given observed time series data of all nodes , where

is the node state vector at

, and is the length of the time window for our observation. The network dynamical rule such that is known but the underlying real network structure is unknown for us. Our task is to recover the real network structure by minimizing the following objective function:


Conventionally, this problem is also called network reconstruction problem in [4, 5].

3 Methodology

We will show that one particular computational approach can solve all the problems mentioned above in a unified way. Notice that although the objective functions in Equation 1, 2, and 3 all have Ising-like form, Equation 4 is much more complicated than others because the dynamical rule may be very complex and even non-analytic. However, we will show that our computational framework can work on all of them.

Naive mean field approximation

The basic idea of our method is to factorise the possible solutions into independent Bernouli variables, this is called naive mean field approximation in statistical physics. We assume that vertices in the network are independent and the joint probability of a configuration can be written as a product distribution[22]:


The marginal probability can be parameterized by a set of parameters which is easily generated by Sigmoid or softmax function.

We sample the possible solution according to Equation 5, and evaluate the objective function

under the given parameters. Then by using the automatic differential techniques empowered by current popular deep learning frameworks such as PyTorch


and TensorFlow

[24], we can obtain the gradient vector for all parameters of variables. And the parameters will be updated according to the gradient vector. However, the sampling process in the above procedure is not differentiable which is required in all deep learning frameworks to compute gradient vector because the sampling operation introduces stochasticity.

Gumbel-softmax technique

Traditionally we can resort to Monte Carlo gradient estimation techniques such as REINFORCE[25]. Gumbel-softmax[13], a.k.a. concrete distribution[14]

provides an alternative approach to tackle the difficulty of non-differentiability. Consider a categorical variable

that can take discrete values . This variable can be parameterised as a K-dimensional vector where is the probability that . Instead of sampling a hard one-hot vector, Gumbel-softmax technique give a K-dimensional sample vector where the i-th entry is



is a random variable following standard Gumbel distribution and

is the temperature parameter. Notice that as , the softmax function will approximate argmax function and the sample vector will approach a one-hot vector. And the one-hot vector can be regarded as a sample according to the distribution because the unitary element will appear on the element in the one-hot vector with probability , therefore, the computation of gumbel-softmax function can simulate the sampling process. Furthermore, this technique allows us to pass gradients directly through the “sampling” process because all the operations in Equation 6 are differentiable. In practice, it is common to adopt a annealing schedule from a high temperature to a small temperature.

Gumbel-softmax optimization

By introducing the Gumbel-softmax technique, the whole process of optimization solution can be stated as follows (also see Figure 1):

  1. Initialization for vertices: .

  2. Sample from simultaneously via gumbel-softmax technique and then calculate the objective function.

  3. Backpropagation to compute gradients and update parameters by gradient descent.

Figure 1: In (a), we cannot use backpropagation algorithm because the sampling from is stochastic and we cannot compute the gradients. In (b), however, due to the Gumbel-softmax technique, the gradients can flow from to .

We point out that our method can be implemented in parallel on GPU: we can simultaneously initialize different initial values and calculate objective functions. When the training procedure is finished, we select the result with best performance from candidates. We present our proposed method in Algorithm 1.

Input: Problem size , batch size , learning rate , and Graph for node state optimization problem or dynamical rule and node observations for structural optimization problem.
Output: solution with the best performance
Initialize ;
until Convergence;
Select solution with the best performance;
Algorithm 1 Gumbel-softmax Optimization (GSO)

Notice that to parallelize the algorithm on GPU, and , and their first dimensions are all batch dimensions.

4 Experiment

Experimental settings

We conduct experiments on all four problems described in Section 2. For SK model, we optimize the ground state energy with various sizes ranging from 256 to 8192. For MIS problem, we use citation network datasets: Cora, Citeseer and PubMed and treat them as undirected networks. For modularity optimization, we use four real-world datasets studied in [17, 3, 18]: Zachary, Jazz, C.elegans and E-mail. We compare our proposed method to other classical optimization methods and state-of-the-art deep learning approaches, e.g., (1) simulated annealing (SA)[6]: a general optimization method inspired by Metropolis-Hastings algorithm; (2) Extremal optimization (EO)[8]: a heuristic designed to address combinatorial optimization problems; (3) Structure2Vec Deep Q-learning (S2V-DQN)[10]: a reinforcement learning method to address optimization problems over graphs; (4) GCNs[11]: a supervised learning method based on graph convolutional networks (GCNs).

In practice, we choose the batch size . During optimization, we use Gumbel-softmax estimator to generate continuous samples to calculate the objective function but discrete samples for the final output. We use Adam optimizer[26] with different learning rate , annealing rate and report results with best performance.

As for the task of structural optimization, we choose Coupled Map Lattice on a given but unknown graph as our dynamical rule, which is widely used to study the chaotic dynamics of spatially extended systems:


where is the coupling constant, is the degree of node and is the set of neighbor nodes of on graph . Here, is the logistic map function . We conduct our experiments on random 4-regular graphs. We first sample an adjacency matrix and then estimate the error of dynamics according to Eq 4.


N I EO[27] SA GSO ()
time time (s) time (s)
256 5000 -0.74585(2) s -0.7278(2) 1.28 -0.7270(2) 0.75
512 2500 -0.75235(3) h -0.7327(2) 3.20 -0.7403(2) 1.62
1024 1250 –0.7563(2) h -0.7352(2) 15.27 -0.7480(2) 3.54
2048 400 - - -0.7367(2) 63.27 -0.7524(1) 5.63
4096 200 - - -0.73713(6) 1591.93 -0.7548(2) 8.38
8192 100 - - - - -0.7566(4) 26.54
Table 1: Results on optimization of ground state energy of SK model for different system size and instances . [27] only reported results of system size up to . In the implementation of simulated annealing, the program failed to finish within 96 hours for . We also present the comparison of time consumption.
N I GD (Adam) GD (L-BFGS) GSO () GSO ()
time (s) time (s) time (s) time (s)
256 5000 -0.6433(3) 2.84 -0.535(2) 2.29 -0.7270(2) 0.75 -0.7369(1) 0.69
512 2500 -0.6456(3) 2.87 -0.520(3) 2.56 -0.7403(2) 1.62 -0.7461(2) 1.61
1024 1250 -0.6466(4) 3.22 -0.501(5) 2.73 -0.7480(2) 3.54 -0.7522(1) 4.09
2048 400 -0.6493(2) 3.53 -0.495(8) 3.06 -0.7524(1) 5.63 -0.75563(5) 12.19
4096 200 -0.6496(5) 4.62 -0.49(1) 3.55 -0.7548(2) 8.38 -0.75692(2) 39.64
8192 100 -0.6508(4) 16.26 -0.46(2) 4.82 -0.7566(4) 26.54 -0.75769(2) 204.26
Table 2: Results on optimization of ground state energy of SK model for different system size and instances . We use different optimizer in gradient descent (GD). We also present the results of parallel version of our proposed method where we choose .

Table 1 and 2 show the results of different approaches on the task of optimizing ground state of energy of SK model of sizes 256, 512, 1024, 2048, 4096 and 8192. Note that: (1) Although extremal optimization (EO) has obtained better results, this method is limited to system size up to 1024 and is extremely time-consuming. (2) Under product distribution, the gradients of ground state energy can be obtained explicitly: from and where is the average magnetization, we have:


We can optimize this objective function through gradient descent (GD) with different optimizer, e.g. Adam or L-BFGS but the results are not satisfying. The reason is that there are too many frustrations and local minima in the energy landscape. (3) With the implementation of the parallel version, the results can be improved greatly.

Graph size S2V-DQN[10] GCNs[11] GD (L-BFGS) Greedy GSO
Cora 2708 1381 1451 1446 1451 1451
Citeseer 3327 1705 1867 1529 1818 1802
PubMed 19717 15709 15912 15902 15912 15861
Table 3: Results on MIS problems.

Table 3 presents the performance of MIS problem on three citation networks. Our proposed method has successfully found the global optimal solution on Cora dataset and obtained much better results compared to the sophisticated S2V-DQN[10] on all three dataset. Although our results are not competitive with [11], we must stressed that it is a supervised learning algorithm. Besides, it also adopts graph reduction techniques and a parallelized local search algorithm. Our method, however, requires none of these tricks.

Graph size [3] EO[27] GSO
No. comms No. comms No. comms
Zachary 34 0.3810 2 0.4188 4 0.4198 4
Jazz 198 0.4379 4 0.4452 5 0.4451 4
C. elegans 453 0.4001 10 0.4342 12 0.4304 8
E-mail 1133 0.4796 13 0.5738 15 0.5275 8
Table 4: Results on modularity optimization. We report the maximum modularity and the corresponding number of communities.
Figure 2: Partition result on Zachary network. The network is divided into four communities with modularity .

On the task of modularity optimization, our method also obtained satisfying results, as shown in Table 4. Comparing to hierarchical agglomeration[17], our proposed method has achieved much higher modularity on all datasets. In Figure 2 we present the partition obtained by our GSO method.

N , Accuracy (%)
10 0.2, 3.5 96.80
10 0.2, 3.8 100
30 0.2, 3.5 93.11
30 0.2, 3.8 100
Table 5: Results on structural optimization.

Here we present our results on the task of structural optimization in Table 5. We set coupling constant fixed. Because is the onset of chaos in the logistic map function, we choose and to present non-chaotic and chaotic dynamics respectively. For a random 4-regular graph with 10 and 30 nodes, our GSO method has obtained very good reconstruction accuracy and the performance obtained on chaotic dynamics is better than that on non-chaotic dynamics.

Time consumption

Figure 3: Time consumption of GSO compared to other approaches. (a): The time for simulated annealing (SA), gradient descent (GD) and our proposed GSO method on optimization of ground state energy of SK model. (b) A log-log plot of time versus system size and the slope is .

Table 1, 2 and Figure 3 shows the time consumption of different methods on SK model. We compare our proposed method with two baselines: gradient descent (GD) with Adam optimizer and simulated annealing (SA). Although gradient descent takes the least amount of time, its performance is the worst. Compared to simulated annealing, our method takes advantages in both performance and time consumption. As for extremal optimization, it is reported in [27] that it takes nearly 20 hours for and the algorithmic cost is , which is extremely inefficient and time-consuming. We also plot the dependence of time with system size in Figure 3. The slope of the log-log plot of time versus is 1.46, which indicates that the algorithmic cost is less than .

5 Conclusion

In this work, we have presented a novel optimization method, Gumbel-softmax optimization (CSO), for solving combinatorial optimization problems on graph. Our method is based on Gumbel-softmax, which allows the gradients passing through samples directly. We treat all vertices in the network are independent and thus the joint distribution is factorized as a product distribution, which enables Gumbel-softmax sample efficiently. Our experiment results show that our method has good performance on all four tasks and also take advantages in time complexity. Comparing to traditional general optimization method, GSO can tackle large graphs easily and efficiently. Though not competitive to state-of-the-art deep learning based method, GSO have the advantage of not requiring neither training the solver nor specific domain knowledge. In general, it is an efficient, general and lightweight optimization framework for solving combinatorial optimization problems on graphs.

However, there is much space to improve our algorithm on accuracy. In this paper, we take the naive mean field approximation as our basic assumption, however, the variables are not independent on most problems. Therefore, much more sophisticated variational inference must considered in the future. Another way to improve accuracy is to combine other skills such as local search in our framework.


  • [1] Richard M Karp. Reducibility among combinatorial problems. In Complexity of computer computations, pages 85–103. Springer, 1972.
  • [2] Marc Mézard, Giorgio Parisi, and Miguel Virasoro. Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications, volume 9. World Scientific Publishing Company, 1987.
  • [3] Mark EJ Newman. Modularity and community structure in networks. Proceedings of the national academy of sciences, 103(23):8577–8582, 2006.
  • [4] Marc Timme. Revealing network connectivity from response dynamics. Physical review letters, 98(22):224101, 2007.
  • [5] Jose Casadiego, Mor Nitzan, Sarah Hallerberg, and Marc Timme. Model-free inference of direct network interactions from nonlinear collective dynamics. Nature communications, 8(1):2192, 2017.
  • [6] Scott Kirkpatrick, C Daniel Gelatt, and Mario P Vecchi. Optimization by simulated annealing. science, 220(4598):671–680, 1983.
  • [7] Lawrence Davis. Handbook of genetic algorithms. 1991.
  • [8] Stefan Boettcher and Allon Percus. Nature’s way of optimizing. Artificial Intelligence, 119(1-2):275–286, 2000.
  • [9] Diogo V Andrade, Mauricio GC Resende, and Renato F Werneck. Fast local search for the maximum independent set problem. Journal of Heuristics, 18(4):525–547, 2012.
  • [10] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In Advances in Neural Information Processing Systems, pages 6348–6358, 2017.
  • [11] Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Combinatorial optimization with graph convolutional networks and guided tree search. In Advances in Neural Information Processing Systems, pages 539–548, 2018.
  • [12] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
  • [13] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings., 2017.
  • [14] Chris J. Maddison, Andriy Mnih, and Yee Whye Teh.

    The concrete distribution: A continuous relaxation of discrete random variables.

    In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
  • [15] David Sherrington and Scott Kirkpatrick. Solvable model of a spin-glass. Physical review letters, 35(26):1792, 1975.
  • [16] Santo Fortunato. Community detection in graphs. Physics reports, 486(3-5):75–174, 2010.
  • [17] Mark EJ Newman. Fast algorithm for detecting community structure in networks. Physical review E, 69(6):066133, 2004.
  • [18] Jordi Duch and Alex Arenas. Community detection in complex networks using extremal optimization. Physical review E, 72(2):027104, 2005.
  • [19] Ulrik Brandes, Daniel Delling, Marco Gaertler, Robert Görke, Martin Hoefer, Zoran Nikoloski, and Dorothea Wagner. Maximizing modularity is hard. arXiv preprint physics/0608255, 2006.
  • [20] Muzaffar M Eusuff and Kevin E Lansey. Optimization of water distribution network design using the shuffled frog leaping algorithm. Journal of Water Resources planning and management, 129(3):210–225, 2003.
  • [21] DE Boyce, A Farhi, and R Weischedel. Optimal network problem: a branch-and-bound algorithm. Environment and Planning A, 5(4):519–533, 1973.
  • [22] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
  • [23] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
  • [24] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016.
  • [25] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
  • [26] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [27] Stefan Boettcher. Extremal optimization for sherrington-kirkpatrick spin glasses. The European Physical Journal B-Condensed Matter and Complex Systems, 46(4):501–505, 2005.