Approximating meta-heuristics with homotopic recurrent neural networks

09/07/2017 ∙ by Alessandro Bay, et al. ∙ 0

Much combinatorial optimisation problems constitute a non-polynomial (NP) hard optimisation problem, i.e., they can not be solved in polynomial time. One such problem is finding the shortest route between two nodes on a graph. Meta-heuristic algorithms such as A^* along with mixed-integer programming (MIP) methods are often employed for these problems. Our work demonstrates that it is possible to approximate solutions generated by a meta-heuristic algorithm using a deep recurrent neural network. We compare different methodologies based on reinforcement learning (RL) and recurrent neural networks (RNN) to gauge their respective quality of approximation. We show the viability of recurrent neural network solutions on a graph that has over 300 nodes and argue that a sequence-to-sequence network rather than other recurrent networks has improved approximation quality. Additionally, we argue that homotopy continuation -- that increases chances of hitting an extremum -- further improves the estimate generated by a vanilla RNN.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

NP (non-deterministic polynomial time) hard problems are the mainstay of some fields, from problems in graph theory (routing/scheduling, etc.) to mathematical programming (knapsack, 3d-matching, etc.). A widely studied NP-hard problem is the Travelling Salesman Problem (TSP) and its derivatives that include finding the shortest routes between two nodes of a graph. Apart from mixed-integer programming (MIP), solutions of such problems are well approximated by meta-heuristic formulations such as tabu search, simulated annealing, genetic algorithms and evolutionary optimisations. Of notable mention are the Dantzig-Fulkerson-Johnson algorithm

[Dantzig, Fulkerson, and Johnson1954], branch-and-cut algorithms [Naddef and Rinaldi2001], neural networks [Ali and Kamoun1993], etc.

The shortest-path problem is central to many real-life scenarios, from inventory delivery (courier, food, vehicles, etc.) to laying out circuitry on a printed circuit board. This problem has evolved from a discrete integer programming (bottom-up) problem to a data-based continuous optimisation problem (top-down). Some studies have used specialised reinforcement learning (RL) algorithms – such as q-learning – to approximate the optimal solutions for the shortest path problem [Boyan and Littman1994]

. Along with the resurgence of deep neural networks (DNN), techniques that merge the scalability of deep neural networks and the theoretical framework of Markov Decision Processes (MDP) have emerged (

aka. deep reinforcement learning (DRL)) [Bello et al.2016, Dai et al.2017]

. The intrinsic non-convexity of the loss function means both DRL and DNN struggle to find the global optimisers of the loss function. This becomes more of a problem as the graph size increases.

For a deep neural network, finding the shortest routes between two points can be framed as a sequence learning problem. Indeed, in this paper, we show how synthetic routes generated by an

algorithm can be approximated using recurrent neural networks. With the rise of the Seq2Seq (sequence to sequence) architecture, recurrent neural networks based on Long Short Term Memory (LSTMs), Gated Recurrent Units (GRUs) and others can be readily used as a sub-module. In this paper, we concentrate on three shortest path finding algorithms on a reasonably sized graph of more than 300 nodes. We use – (a) meta-heuristics based

algorithm [Hart, Nilsson, and Raphael1968], (b) q-learning based Q-routing algorithm [Boyan and Littman1994] and (c) a Seq2Seq recurrent neural network [Sutskever, Vinyals, and Le2014], amongst other vanilla recurrent architectures. Moving on, we argue that the value of extremums generated by the RNN could be further improved via the homotopic continuation of the neural network’s loss function.

Methods

Datasets

The graph is based on the road network of Minnesota111https://www.cs.purdue.edu/homes/dgleich/packages/matlab_bgl. Each node represents the intersections of roads while the edges represent the road that connects the two points of intersection. Specifically, the graph we considered has 376 nodes and 455 edges, as we constrained the coordinates of the nodes to be in the range for the longitude and for the latitude, instead of the full extent of the graph, i.e., a longitude of and a latitude of , with a total number of 2,642 nodes.

Algorithms

The meta-heuristics

The algorithm is a best-first search algorithm wherein it searches amongst all of the possible paths that yield the smallest cost. This cost function is made up of two parts – particularly, each iteration of the algorithm consists of first evaluating the distance travelled or time expended from the start node to the current node. The second part of the cost function is a heuristic that estimates the cost of the cheapest path from the current node to the goal. Without the heuristic part, this algorithm operationalises the Dijkstra’s algorithm [Dijkstra1959]. There are many variants of ; in our experiments, we use the vanilla with a heuristic based on the Euclidean distance. Other variants such as Anytime Repairing has been shown to give superior performance [Likhachev, Gordon, and Thrun2004]. generated paths between two randomly selected nodes are calculated. On an average, the paths are 19 hops long and follow the distribution represented by histograms in Figure 1.

Figure 1: Distribution of paths lengths. After selecting two nodes uniformly at random, we compute the shortest paths using the algorithm. The average path length is 19 hops.

De-centralised Q-routing

Q-Learning is an Off-Policy algorithm for temporal difference learning. Boyan1994 (Boyan1994) have formulated the Q-routing algorithm by building a routing table based on node distance/time (q-values). This algorithm utilises q-learning wherein the nodes that are in the neighbourhood of the current node communicate the expected future waiting time. In Q-routing, every source node has to choose the interim node that leads it to the destination node . Q-learning enables us to learn the expected travel time to for each of the possible interim node . A q-table is created for each node that is updated at discrete intervals as,

(1)

Here, is the learning rate and is the time spent at node before being sent off at time . Q-learning can thus estimate the value/cost (for Q-routing it is the estimated transit time from the current node to destination via node ) function for being in state as . A greedy policy is then executed to transact optimally.

Recurrent deep networks

We utilised a variety of recurrent neural networks for shortest route path predictions:

  • A vanilla RNN [Goodfellow, Bengio, and Courville2016]:

    (2)

    with weights , which takes as input the current node, described by a one-hot representation, and estimates the following one. The function can be a softmax

    , since we expect a probability distribution on the following node, or a

    log softmax, that gives more numerical stability during training. During the test phase, we use the predicted node as input for the next time step and compute two paths, one starting from the source and one from the destination, that form an intersection to obtain the shortest path.

  • A Seq2Seq-based model [Sutskever, Vinyals, and Le2014]: We start with the tuple [source, destination] as an input sequence, which is encoded by a vanilla RNN (Figure 2

    ). The context vector, i.e., the encoding, is then decoded by another RNN, LSTM or a GRU

    [Cho et al.2014] to obtain the shortest path connecting the source to the destination.

  • A Seq2Seq-based model with attention [Bahdanau, Cho, and Bengio2014] enables us to focus on the encoded states with varying extent.

  • A vanilla RNN, trained with homotopy continuation [Vese1999] – convolution of the loss with a Gaussian kernel, as explained in the following section.

Figure 2: Sequence-to-Sequence architecture for approximating the meta-heuristics. Here, the first two modules on the left are the encoder while the last four represent the decoded output, representing the shortest route between Holborn and Bank. The network is trained using shortest route snippets that have been generated using an algorithm. represents the context vector.

Homotopy continuation

In algebraic topology, two continuous functions in topological space that can be transformed (deformed) from one to another are known as homotopies. In the context of neural networks, such homotopies enable us to pose different instantiations of the loss function with the aim of obtaining a global minimum. Deep neural networks always have a non-convex loss with no convergence guarantees; this makes learning optimal parameters a numerical exercise. A homotopy continuation allows us to gradually deform the loss function to a non-convex one from another that has a well-defined minimiser. The minimisers from the -th sub-problem are the starting points for the subsequent -th subproblem.

Assume that represents a problem with global optimizers whilst is another function where we have no a priori knowledge of the optimizers. Then the homotopy function becomes . The path traced out by the states are governed by . If the Jacobian

is of full-rank, one can guarantee that the path between the two functions are continuously differentiable. In our treatise, we use a continuation path defined by the heat equation. In a similar vein to Mobahi2016 (Mobahi2016), we convolve the loss function with an isotropic Gaussian (the Weierstrass transform) of standard deviation

; this enables us to obtain objective functions with varying smoothness. then becomes the continuation parameter for the diffused cost function. We illustrate the basic methodology using a single dimensional function, followed by the diffusion of the RNN’s loss function.

(3)

As an illustration, we construct a cost function that has both a local and global minima as well as a saddle point, i.e., , constrained to the interval . This function is shown in Figure 3 as a solid black line, along with its diffused forms, obtained by the convolution with the Gaussian kernel (Eqn. 3) with standard deviation . As one can see, for higher values of (dashed red line), the smoothed function has only one minimum, which can be easily attained via an arbitrary optimisation method. Then, starting from this solution, we can avoid falling into the local minimum and eventually reach the global extremiser for smaller values of (other dashed lines). Repeating this until attains low values allows the diffused function to match the original, thereby obtaining the global minimum.

Figure 3: Example of a diffused cost function. We choose a cost function equal to , which has both local and global minima for . Applying the Weierstrass transform we obtain smoothed forms for the cost function, which converges to the original cost as decreases.

Diffused cost function

In this section, we illustrate the equations leading to the diffused loss function for the RNN. Notably, given a set of shortest sequences, each of length , and a loss function , we can state the minimisation problem, of the cost function w.r.t. the weights in a vanilla RNN (Eqn. 2) as,

(4)
s.t.

where is the initial hidden state.

Moreover, if we consider and as independent variables, we can write the following unconstrained Lagrangian,

(5)

where all the vectors and are collected in the columns of the matrices and , respectively, is a regularization parameter (we will use ), and is a penalty function.

Since we consider , then the natural choice for the loss is the negative log-likelihood, while for the penality function , we choose the squared Euclidean norm. Therefore, Eqn. Diffused cost function now becomes

(6)

Finally, defining the convolution of the non-linear functions and with the Gaussian kernel (Eqn. 3) as

respectively, we can derive the constrained diffused cost w.r.t. the weights of the RNN, as

(7)
s.t.

where denotes a diagonal matrix with the vector on the diagonal, is the Frobenius norm of a matrix, and stateDim and outDim

represent the number of neurons in the hidden state and of the output, respectively. Notice that we have also used the identities that convolution of

with is equal to and [Mobahi2016].

Approximation of activation functions

Another crucial component of this method is the computation of the diffused non-linearities for the considered RNN. Following [Mobahi2016] convolving with , we obtain

(8)

However, the output non-linearity that we use is the log softmax, which cannot be analytically convolved. To address this problem, our first attempt was the numerical computation of the convolved function, but in this case, we obtained numerical errors, especially on the boundaries. This makes the approximation less accurate. To sidestep this phenomenon, we consider a linear interpolation

of the central interval, obtaining the following slope, which depends on the standard deviation :

(9)

and the shift is simply the constant .

As we can see in Figure 4, some errors affect the numerical convolution, especially on the boundaries (dash-dotted lines). Thus, a linear equation with slope as in Eqn. 9 is more accurate and converges to the original logsoftmax function (solid black line) as becomes smaller (in Figure 4 we illustrate the results for ).

Figure 4: Approximation for the convolved log softmax. Since log softmax function cannot be analytically convolved with the Gaussian kernel, we approximate this operation numerically. It is apparent that the quality of the approximation is affected by the numerical errors, especially on the boundaries (dash-dotted lines). Therefore, we constrain the approximation via linear interpolation of the central interval (solid lines) at the edges.

Table 1

illustrates the diffused forms of the most popular activation functions.

function original diffused
error
tanh
sign
relu
logsoftmax
Table 1: List of diffused forms (Weierstrass transform). We report the most popular non-linear activation functions along with their diffused form. This is obtained by convolving the function with the heat kernel . This table extends the work in [Mobahi2016] by an analytic approximation of the log softmax function.

Results

For the graph of Minnesota with 376 nodes and 455 edges, we generated 3,000 shortest routes between two randomly picked nodes using the algorithm. We used these routes as the training set for the Q-routing and the RNN algorithms using a 67-33% training-test splits.

For the Q-routing algorithm, we set the learning coefficient and use 1,000 iterations for each path to update the q-table and enable the algorithm to converge. We obtain an accuracy on the shortest paths of 70% for the test data-set, and 97% of the test paths reach the destination, albeit they are not necessarily the shortest ones.

On the other hand, we choose a hidden state with 256 units for the RNN-based methods and run the training for 200 epochs updating the parameters with an Adam optimisation scheme

[Kingma and Ba2014] with parameters and , starting from a learning rate equal to 1e-3.

method shortest successful
Q-routing
RNNvanilla
RNNintersect
RNNwithDiff 99%
Seq2SeqnoAttn
Seq2SeqwithAttn
Seq2SeqintersectnoAttn
Seq2SeqintersectwithAttn 78%
Table 2: Results on the Minnesota graph. Percentage of shortest path and successful paths (that are not necessarily shortest) are shown for a RL algorithm (i.e., Q-routing) and a wide-variety of RNNs, where intersect means the computation of the shortest path as intersection of the predicted path and the one obtained from the destination to the source node. All scores are relative to an algorithm, that achieves a shortest path score of 100%.

The prediction accuracy on the test data-set is shown in Table 2. A vanilla RNN predicting the next node gives poor results; especially we notice that most of the time it cannot even predict a path from the source to the destination node. Nevertheless, on occasions when it reaches the destination, the path is always the shortest. A first attempt to improve its performance was simply to compute two paths – one from the source to the destination node and another one from the destination to the source, and intersecting the two paths to obtain a route between the source and the destination. Such a scheme improved the performance to an accuracy equal to 98% for paths linking source and destination and to 60% for shortest ones. Then, we train a vanilla RNN through an homotopy continuation method (see Eqn. Diffused cost function) varying , which gives a further improvement (i.e., 63% and 99%) and converges after only 80 epochs (instead of 200 epochs).

Alternatively, we can also alter the architecture of the RNN, getting inspiration from the recent success of the sequence-to-sequence model [Sutskever, Vinyals, and Le2014]. We involve it, encoding the [source, destination] tuple via a vanilla RNN and then decoding the context vector into the shortest path sequence. We implemented the decoder in two different ways: it can be either a vanilla RNN or a GRU, or a vanilla RNN or a GRU with attention [Bahdanau, Cho, and Bengio2014]. In particular, by embedding attention, we can outperform the accuracy of the Q-routing algorithm on the shortest paths (73%), especially if we compute two paths and intersect them (78%).

Therefore, the most accurate algorithm for finding the shortest path is the Seq2Seq model with attention, where during the test phase we evaluate the intersection between two paths: one from the source to the destination node and another from the destination to the source node. On the other hand, a vanilla RNN, when trained with diffusion, computes successful paths from source to destination, albeit they are not necessarily the shortest ones. This suggests that further improvements can be obtained by training a Seq2Seq model with diffusion.

Discussion

In this paper, we illustrate that recurrent neural networks have the fidelity in approximating routes generated from an algorithm. As the node size increases, Seq2Seq models have increased fidelity compared to vanilla recurrent neural networks or for that matter a vanilla q-learning based routing algorithm. Our work is yet another testament to the utility that deep recurrent networks have in approximating solutions of NP hard problems.

When parameter evolution is constrained to evolve according to a heat equation, it has resulted in superior posterior estimates when utilised in a Bayesian belief updating scheme [Cooray et al.2017]. Similarly, combining annealed importance sampling with Riemannian samplers have also shown promise in producing better posterior distributions [Penny and Sengupta2016]

. Homotopy continuation provides a rigorous theoretical foundation for the afore mentioned results. It is well-known that the convex envelope of a non-convex loss function can be obtained using an evolution equation, i.e., a (non-linear) partial differential equation (PDE)

[Vese1999]. The heat equation simply provides an affine solution to this non-linear PDE. One important direction for future work is, therefore, tackling the original nonlinear PDE with computationally efficient algebraic or geometric multi-grid methods [Heroux and Willenbring2012, Sundar et al.2012].

We have used Q-routing as a benchmark to learn the action-value pair that gives us the expected utility of taking a prescribed action; in the last few years, many other architectures have evolved such as the deep-Q-network [Mnih et al.2015], duelling architecture [Wang et al.2015], etc. – it remains to be seen how such deep reinforcement learning architectures cope with the problem at hand. Another vital direction to pursue is the scalability of inverse reinforcement learning algorithms [Ziebart et al.2008] wherein given the policy and the system dynamics the aim is to recover the reward function.

An important issue arising from using recurrent neural networks for computing shortest path is to memorise long sequences, such as routes that range hundreds of nodes. LSTM [Hochreiter and Schmidhuber1997] alleviated some of the problems; the other proposition has been to use the second-order geometry as has been long proposed [LeCun et al.2012]. Other efforts have gone towards using evolutionary optimisation algorithms such as CoDeepNEAT for finessing the neural network architecture [Miikkulainen et al.2017]

. Similarly, Neural Turing Machines

[Graves, Wayne, and Danihelka2014] are augmented RNNs that have a (differentiable) external memory that they can selectively read/write, enable the network to store the latent structure of the sequence. Attention networks, as has been used here, enable the RNNs to attend to snippets of their inputs (cf. in conversational modelling [Vinyals and Le2015]). Nevertheless, for the long-term dependencies that shortest paths in large graphs have, these methods are steps towards alleviating the central problem of controlling spectral radius in recurrent neural networks.

The Q-routing algorithm reduces the search-space by constraining the search to the connected neighbours of the current node. Whereas, the RNN variants have a large state-space to explore. The Seq2Seq architecture, therefore, has at least two orders of magnitude increase in its exploration space. In future, the accuracy of the RNN variants can be increased by constraining the search space to the neighbourhood of the current node.

In a real-world scenario, computing the shortest path between two nodes of a graph is also constrained by the computational time of the algorithm. Thus, the deep recurrent networks, as utilised here, are faced with two objectives – first, to reduce the prediction error and second to reduce the computational effort. In general, it is quite uncommon that an obtained solution can be obtained that minimises both objectives. In lieu, one can concentrate effort on achieving solutions on the Pareto front. Utilising, Bayesian optimisation [Brochu, Cora, and De Freitas2010] for such multi-objective optimisation problem is a path forward for research that is not only theoretically illuminating but also commercially useful.

Acknowledgments

BS is thankful to the Issac Newton Institute for Mathematical Sciences for hosting him during the “Periodic, Almost-periodic, and Random Operators” workshop.

References