1 Introduction
Combinatorial optimization problems (COPs) ask for optimizing a given objective function whose domain is a discrete but large configuration space[1]. COPs arise in many disciplines and many realworld optimization problems can be converted to COPs. In computer science, there exist a large number of COPs defined on graphs, e.g., maximal independent set (MIS) and minimum vertex cover (MVC)[1]. In these problems, one is asked to give a largest (or smallest) subset of the graph under some constraints. In statistical physics, finding the ground state configuration of spin glasses model where the energy is minimized is another type of COPs[2]. Many problems of network science can also be treated as COPs, for example, modularity optimization[3] asks to specify which community one belongs to that maximize the modularity value. While, structural optimization asks to reveal the underlying network structure under some constraints. For example, finding the network structure to recover the observed network dynamics if the dynamical rules are given. This is also called network reconstruction problem[4, 5]. In general, the space of possible solutions of mentioned problems is typically very large and grows exponentially with system size, thus impossible to solve by exhaustion.
Many methods have been proposed to solve COPs. Traditional general optimization methods such as simulated annealing (SA)[6], genetic algorithm (GA)[7], extremal optimization (EO) [8]
usually suffer from slow convergence and are limited to system size up to thousand. Although there exist many other heuristic solvers such as local search
[9], they are usually domainspecific and require special knowledge.Recently a great deal of effort has been devoted to applying machine learning to solve COPs. Khalil et al.[10]
used a combination of reinforcement learning and graph embedding called S2VDQN to address combinatorial optimization problems on graphs. Li et al.
[11] used graph convolution networks (GCNs)[12]with classic heuristics and achieved stateofart performance compared to S2VDQN. However, these two algorithms all belong to supervised learning, thus contain two stages of problem solving: first training the solver and then testing. Although relatively good solutions can be obtained efficiently, it takes a long time for training the solver and the quality of solutions depends heavily on the quality and the amount of the data for training, which is hardly for large graphs. Some techniques have been developed in reinforcement learning area which seeks to solve the problems directly without training and testing stages. For example, REINFORCE algorithm is a typical gradient estimator for discrete optimization. However, it suffers from high variance and timeconsuming problems. Recently reparameterization trick developed in machine learning community such as Gumbelsoftmax
[13, 14] provides another approach for differentiable sampling. It allows us to pass gradients through samples directly and this method has not been applied to solving COPs on graphs so far.In this work, we present a novel optimization method based on Gumbelsoftmax, called Gumbelsoftmax optimization (CSO) method. Under the naive mean field assumption, we assume that the joint probability of a configuration is factorized by the product of marginal probabilities. The marginal probabilities, which stand for the states of nodes or edges, are parameterized by a set of learnable parameters. We can sample a configuration through Gumbelsoftmax and compute the objective function. Then we run backpropagation algorithm and update parameters. Furthermore, GSO can be implemented in parallel easily on GPU to simultaneously solve the optimization problem under different initial conditions, and one solution with the best performance can be selected. We first introduce four different problem in Section
2. Then the details of GSO algorithm is explained in Section 3. In Section 4 we introduce our main results on all four problems, and compare with different benchmark algorithms. The results show that GSO can achieved satisfying solutions and also benefit from time consumption. Finally some concluding remarks are given in Section 5.2 Preliminary
The combinatorial optimization problems that we concerned can be formulated as follows. Let’s consider variables where and each variable can take discrete values: . The combination of the values for all variables is defined as a configuration or a state . The objective function is a map between the configuration space and the real number field, . Thus, the goal of the combinatorial optimization problem is to find a configuration such that the objective function is minimized, i.e., for all . Usually, a set of constraints must be considered which can be formulated as boolean functions on . Nevertheless, they can be absorbed into the objective functions by introducing the punishment parameter. Therefore, we only consider the optimization problem without constraints in this paper for the simplicity.
Here, we consider two kinds of COPs on graphs: one is node optimization problem, the other is structure optimization problem. The node optimization problem can be stated as follows. Given a graph where is the set of vertices in and is the set of edges. Each vertex is a variable in the optimization.
The structure optimization problem is to find an optimal network structure such the objective function defined on graphs can be optimized. If we treat the adjacency matrix of the graph as the state of the system, and each entry of the matrix is a variable which taking value from , then structural optimization problem is also a COP.
In this work we focus on four typical optimization problems:
SherringtonKirkpatrick (SK) model
SK model[15] is a celebrated spin glasses model defined on a complete graph where all vertices are connected. The objective function, or the ground state energy of this model is defined as
(1) 
where and is the coupling strength between two vertices. We are asked to give a configuration where the ground state energy is minimized. The number of possible configurations is and in fact, finding the ground state configuration of SK model is an NPhard problem[2].
Maximal Independent Set (MIS)
MIS problem is another wellknown NPhard problem. The reason why we focus on MIS problem is that many other NPhard problems such as Minimum Vertex Cover (MVC) problem and Maximal Clique (MC) problem can be reduced to an MIS problem. An MIS problem asks to find the largest subset such that no two vertices in are connected by an edge in . The Isinglike objective function consists of two parts:
(2) 
where and the second term provides an penalty for two connected vertices being chosen.
is the hyperparameter for the strenghth of the punishment. The first term is the number of vertices in
.Modularity optimization
Modularity is a graph clustering index for detecting community structure in complex networks[16]. Many modularity optimization methods, e.g., hierarchical agglomeration[17], spectral algorithm[3] and extremal optimization[18] have been proposed and it is suggested that maximizing modularity is strongly NPcomplete[19]. In general cases where a graph is partitioned into communities, the objective is to maximize the following modularity :
(3) 
where is the degree of vertex , , where is the number of communities, if and otherwise.
Structural optimization
The problem of structural optimization is to find an optimal network structure for some objective function defined on a network[20, 21]. This problem can also be considered as a special example of a COP if the element of the adjacency matrix is the optimized variable, and a configuration is the adjacency matrix of a potential graph, where is the number of nodes in the network which is a fixed value.
For example, we can find an optimal network to fit the observed node dynamics, in which, we are asked to optimize the network structure to match the given observed time series data of all nodes , where
is the node state vector at
, and is the length of the time window for our observation. The network dynamical rule such that is known but the underlying real network structure is unknown for us. Our task is to recover the real network structure by minimizing the following objective function:(4) 
3 Methodology
We will show that one particular computational approach can solve all the problems mentioned above in a unified way. Notice that although the objective functions in Equation 1, 2, and 3 all have Isinglike form, Equation 4 is much more complicated than others because the dynamical rule may be very complex and even nonanalytic. However, we will show that our computational framework can work on all of them.
Naive mean field approximation
The basic idea of our method is to factorise the possible solutions into independent Bernouli variables, this is called naive mean field approximation in statistical physics. We assume that vertices in the network are independent and the joint probability of a configuration can be written as a product distribution[22]:
(5) 
The marginal probability can be parameterized by a set of parameters which is easily generated by Sigmoid or softmax function.
We sample the possible solution according to Equation 5, and evaluate the objective function
under the given parameters. Then by using the automatic differential techniques empowered by current popular deep learning frameworks such as PyTorch
[23]and TensorFlow
[24], we can obtain the gradient vector for all parameters of variables. And the parameters will be updated according to the gradient vector. However, the sampling process in the above procedure is not differentiable which is required in all deep learning frameworks to compute gradient vector because the sampling operation introduces stochasticity.Gumbelsoftmax technique
Traditionally we can resort to Monte Carlo gradient estimation techniques such as REINFORCE[25]. Gumbelsoftmax[13], a.k.a. concrete distribution[14]
provides an alternative approach to tackle the difficulty of nondifferentiability. Consider a categorical variable
that can take discrete values . This variable can be parameterised as a Kdimensional vector where is the probability that . Instead of sampling a hard onehot vector, Gumbelsoftmax technique give a Kdimensional sample vector where the ith entry is(6) 
where
is a random variable following standard Gumbel distribution and
is the temperature parameter. Notice that as , the softmax function will approximate argmax function and the sample vector will approach a onehot vector. And the onehot vector can be regarded as a sample according to the distribution because the unitary element will appear on the element in the onehot vector with probability , therefore, the computation of gumbelsoftmax function can simulate the sampling process. Furthermore, this technique allows us to pass gradients directly through the “sampling” process because all the operations in Equation 6 are differentiable. In practice, it is common to adopt a annealing schedule from a high temperature to a small temperature.Gumbelsoftmax optimization
By introducing the Gumbelsoftmax technique, the whole process of optimization solution can be stated as follows (also see Figure 1):

Initialization for vertices: .

Sample from simultaneously via gumbelsoftmax technique and then calculate the objective function.

Backpropagation to compute gradients and update parameters by gradient descent.
We point out that our method can be implemented in parallel on GPU: we can simultaneously initialize different initial values and calculate objective functions. When the training procedure is finished, we select the result with best performance from candidates. We present our proposed method in Algorithm 1.
Notice that to parallelize the algorithm on GPU, and , and their first dimensions are all batch dimensions.
4 Experiment
Experimental settings
We conduct experiments on all four problems described in Section 2. For SK model, we optimize the ground state energy with various sizes ranging from 256 to 8192. For MIS problem, we use citation network datasets: Cora, Citeseer and PubMed and treat them as undirected networks. For modularity optimization, we use four realworld datasets studied in [17, 3, 18]: Zachary, Jazz, C.elegans and Email. We compare our proposed method to other classical optimization methods and stateoftheart deep learning approaches, e.g., (1) simulated annealing (SA)[6]: a general optimization method inspired by MetropolisHastings algorithm; (2) Extremal optimization (EO)[8]: a heuristic designed to address combinatorial optimization problems; (3) Structure2Vec Deep Qlearning (S2VDQN)[10]: a reinforcement learning method to address optimization problems over graphs; (4) GCNs[11]: a supervised learning method based on graph convolutional networks (GCNs).
In practice, we choose the batch size . During optimization, we use Gumbelsoftmax estimator to generate continuous samples to calculate the objective function but discrete samples for the final output. We use Adam optimizer[26] with different learning rate , annealing rate and report results with best performance.
As for the task of structural optimization, we choose Coupled Map Lattice on a given but unknown graph as our dynamical rule, which is widely used to study the chaotic dynamics of spatially extended systems:
(7) 
where is the coupling constant, is the degree of node and is the set of neighbor nodes of on graph . Here, is the logistic map function . We conduct our experiments on random 4regular graphs. We first sample an adjacency matrix and then estimate the error of dynamics according to Eq 4.
Performance
N  I  EO[27]  SA  GSO ()  

time  time (s)  time (s)  
256  5000  0.74585(2)  s  0.7278(2)  1.28  0.7270(2)  0.75 
512  2500  0.75235(3)  h  0.7327(2)  3.20  0.7403(2)  1.62 
1024  1250  –0.7563(2)  h  0.7352(2)  15.27  0.7480(2)  3.54 
2048  400      0.7367(2)  63.27  0.7524(1)  5.63 
4096  200      0.73713(6)  1591.93  0.7548(2)  8.38 
8192  100          0.7566(4)  26.54 
N  I  GD (Adam)  GD (LBFGS)  GSO ()  GSO ()  

time (s)  time (s)  time (s)  time (s)  
256  5000  0.6433(3)  2.84  0.535(2)  2.29  0.7270(2)  0.75  0.7369(1)  0.69 
512  2500  0.6456(3)  2.87  0.520(3)  2.56  0.7403(2)  1.62  0.7461(2)  1.61 
1024  1250  0.6466(4)  3.22  0.501(5)  2.73  0.7480(2)  3.54  0.7522(1)  4.09 
2048  400  0.6493(2)  3.53  0.495(8)  3.06  0.7524(1)  5.63  0.75563(5)  12.19 
4096  200  0.6496(5)  4.62  0.49(1)  3.55  0.7548(2)  8.38  0.75692(2)  39.64 
8192  100  0.6508(4)  16.26  0.46(2)  4.82  0.7566(4)  26.54  0.75769(2)  204.26 
Table 1 and 2 show the results of different approaches on the task of optimizing ground state of energy of SK model of sizes 256, 512, 1024, 2048, 4096 and 8192. Note that: (1) Although extremal optimization (EO) has obtained better results, this method is limited to system size up to 1024 and is extremely timeconsuming. (2) Under product distribution, the gradients of ground state energy can be obtained explicitly: from and where is the average magnetization, we have:
(8)  
We can optimize this objective function through gradient descent (GD) with different optimizer, e.g. Adam or LBFGS but the results are not satisfying. The reason is that there are too many frustrations and local minima in the energy landscape. (3) With the implementation of the parallel version, the results can be improved greatly.
Graph  size  S2VDQN[10]  GCNs[11]  GD (LBFGS)  Greedy  GSO 

Cora  2708  1381  1451  1446  1451  1451 
Citeseer  3327  1705  1867  1529  1818  1802 
PubMed  19717  15709  15912  15902  15912  15861 
Table 3 presents the performance of MIS problem on three citation networks. Our proposed method has successfully found the global optimal solution on Cora dataset and obtained much better results compared to the sophisticated S2VDQN[10] on all three dataset. Although our results are not competitive with [11], we must stressed that it is a supervised learning algorithm. Besides, it also adopts graph reduction techniques and a parallelized local search algorithm. Our method, however, requires none of these tricks.
Graph  size  [3]  EO[27]  GSO  

No. comms  No. comms  No. comms  
Zachary  34  0.3810  2  0.4188  4  0.4198  4 
Jazz  198  0.4379  4  0.4452  5  0.4451  4 
C. elegans  453  0.4001  10  0.4342  12  0.4304  8 
1133  0.4796  13  0.5738  15  0.5275  8 
On the task of modularity optimization, our method also obtained satisfying results, as shown in Table 4. Comparing to hierarchical agglomeration[17], our proposed method has achieved much higher modularity on all datasets. In Figure 2 we present the partition obtained by our GSO method.
N  ,  Accuracy (%) 

10  0.2, 3.5  96.80 
10  0.2, 3.8  100 
30  0.2, 3.5  93.11 
30  0.2, 3.8  100 
Here we present our results on the task of structural optimization in Table 5. We set coupling constant fixed. Because is the onset of chaos in the logistic map function, we choose and to present nonchaotic and chaotic dynamics respectively. For a random 4regular graph with 10 and 30 nodes, our GSO method has obtained very good reconstruction accuracy and the performance obtained on chaotic dynamics is better than that on nonchaotic dynamics.
Time consumption
Table 1, 2 and Figure 3 shows the time consumption of different methods on SK model. We compare our proposed method with two baselines: gradient descent (GD) with Adam optimizer and simulated annealing (SA). Although gradient descent takes the least amount of time, its performance is the worst. Compared to simulated annealing, our method takes advantages in both performance and time consumption. As for extremal optimization, it is reported in [27] that it takes nearly 20 hours for and the algorithmic cost is , which is extremely inefficient and timeconsuming. We also plot the dependence of time with system size in Figure 3. The slope of the loglog plot of time versus is 1.46, which indicates that the algorithmic cost is less than .
5 Conclusion
In this work, we have presented a novel optimization method, Gumbelsoftmax optimization (CSO), for solving combinatorial optimization problems on graph. Our method is based on Gumbelsoftmax, which allows the gradients passing through samples directly. We treat all vertices in the network are independent and thus the joint distribution is factorized as a product distribution, which enables Gumbelsoftmax sample efficiently. Our experiment results show that our method has good performance on all four tasks and also take advantages in time complexity. Comparing to traditional general optimization method, GSO can tackle large graphs easily and efficiently. Though not competitive to stateoftheart deep learning based method, GSO have the advantage of not requiring neither training the solver nor specific domain knowledge. In general, it is an efficient, general and lightweight optimization framework for solving combinatorial optimization problems on graphs.
However, there is much space to improve our algorithm on accuracy. In this paper, we take the naive mean field approximation as our basic assumption, however, the variables are not independent on most problems. Therefore, much more sophisticated variational inference must considered in the future. Another way to improve accuracy is to combine other skills such as local search in our framework.
References
 [1] Richard M Karp. Reducibility among combinatorial problems. In Complexity of computer computations, pages 85–103. Springer, 1972.
 [2] Marc Mézard, Giorgio Parisi, and Miguel Virasoro. Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications, volume 9. World Scientific Publishing Company, 1987.
 [3] Mark EJ Newman. Modularity and community structure in networks. Proceedings of the national academy of sciences, 103(23):8577–8582, 2006.
 [4] Marc Timme. Revealing network connectivity from response dynamics. Physical review letters, 98(22):224101, 2007.
 [5] Jose Casadiego, Mor Nitzan, Sarah Hallerberg, and Marc Timme. Modelfree inference of direct network interactions from nonlinear collective dynamics. Nature communications, 8(1):2192, 2017.
 [6] Scott Kirkpatrick, C Daniel Gelatt, and Mario P Vecchi. Optimization by simulated annealing. science, 220(4598):671–680, 1983.
 [7] Lawrence Davis. Handbook of genetic algorithms. 1991.
 [8] Stefan Boettcher and Allon Percus. Nature’s way of optimizing. Artificial Intelligence, 119(12):275–286, 2000.
 [9] Diogo V Andrade, Mauricio GC Resende, and Renato F Werneck. Fast local search for the maximum independent set problem. Journal of Heuristics, 18(4):525–547, 2012.
 [10] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In Advances in Neural Information Processing Systems, pages 6348–6358, 2017.
 [11] Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Combinatorial optimization with graph convolutional networks and guided tree search. In Advances in Neural Information Processing Systems, pages 539–548, 2018.
 [12] Thomas N Kipf and Max Welling. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 [13] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings. OpenReview.net, 2017.

[14]
Chris J. Maddison, Andriy Mnih, and Yee Whye Teh.
The concrete distribution: A continuous relaxation of discrete random variables.
In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, 2017.  [15] David Sherrington and Scott Kirkpatrick. Solvable model of a spinglass. Physical review letters, 35(26):1792, 1975.
 [16] Santo Fortunato. Community detection in graphs. Physics reports, 486(35):75–174, 2010.
 [17] Mark EJ Newman. Fast algorithm for detecting community structure in networks. Physical review E, 69(6):066133, 2004.
 [18] Jordi Duch and Alex Arenas. Community detection in complex networks using extremal optimization. Physical review E, 72(2):027104, 2005.
 [19] Ulrik Brandes, Daniel Delling, Marco Gaertler, Robert Görke, Martin Hoefer, Zoran Nikoloski, and Dorothea Wagner. Maximizing modularity is hard. arXiv preprint physics/0608255, 2006.
 [20] Muzaffar M Eusuff and Kevin E Lansey. Optimization of water distribution network design using the shuffled frog leaping algorithm. Journal of Water Resources planning and management, 129(3):210–225, 2003.
 [21] DE Boyce, A Farhi, and R Weischedel. Optimal network problem: a branchandbound algorithm. Environment and Planning A, 5(4):519–533, 1973.
 [22] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
 [23] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPSW, 2017.
 [24] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for largescale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016.
 [25] Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
 [26] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [27] Stefan Boettcher. Extremal optimization for sherringtonkirkpatrick spin glasses. The European Physical Journal BCondensed Matter and Complex Systems, 46(4):501–505, 2005.
Comments
There are no comments yet.