Reinforcement Learning Driven Heuristic Optimization

06/16/2019 ∙ by Qingpeng Cai, et al. ∙ Google Tsinghua University Stanford University 8

Heuristic algorithms such as simulated annealing, Concorde, and METIS are effective and widely used approaches to find solutions to combinatorial optimization problems. However, they are limited by the high sample complexity required to reach a reasonable solution from a cold-start. In this paper, we introduce a novel framework to generate better initial solutions for heuristic algorithms using reinforcement learning (RL), named RLHO. We augment the ability of heuristic algorithms to greedily improve upon an existing initial solution generated by RL, and demonstrate novel results where RL is able to leverage the performance of heuristics as a learning signal to generate better initialization. We apply this framework to Proximal Policy Optimization (PPO) and Simulated Annealing (SA). We conduct a series of experiments on the well-known NP-complete bin packing problem, and show that the RLHO method outperforms our baselines. We show that on the bin packing problem, RL can learn to help heuristics perform even better, allowing us to combine the best parts of both approaches.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Combinatorial optimization (Wolsey and Nemhauser, 2014) aims to find the optimal solution with the minimum cost from a finite set of candidates to discrete problems such as the bin packing problem, the traveling salesman problem, or integer programming. Combinatorial optimization has seen broad applicability in fields ranging from telecommunications network design, to task scheduling, to transportation systems planning. As many of these combinatorial optimization problems are NP-complete, optimal solutions cannot be tractably found (Ausiello et al., 2012).

Heuristic algorithms such as simulated annealing (SA) (Rutenbar, 1989; Aarts and Korst, 1988; Van Laarhoven and Aarts, 1987) are designed to search for the optimal solution by randomly perturbing candidate solutions and accepting those that satisfy some greedy criterion such as Metropolis-Hastings. Heuristics are widely used in combinatorial optimization problems such as Concorde for the traveling salesman problem, or METIS for graph partitioning (Applegate et al., 2006; Karypis and Kumar, 1999). Some heuristic algorithms like SA are theoretically guaranteed to find the optimal solution to a problem given a low enough temperature and enough perturbations (Ingber, 1993).

However, the framework for heuristic algorithms begins the solution search from a randomly initialized candidate solution. For example, in the bin packing problem, the initial solution fed into SA would be a random assignment of objects to bins, which would then be repeatedly perturbed until convergence. Starting hill climbing from a cold start is time-consuming and limits the applicability of heuristic algorithms on practical problems.

Reinforcement learning (RL) has been proposed as a technique to yield efficient solutions to combinatorial optimization problems by first learning a policy, and then using it to generate a solution to the problem. RL has seen interesting applications in real world combinatorial optimization problems (Zoph and Le, 2016; Mirhoseini et al., 2017). However, RL lacks the theoretical guarantees of algorithms like SA, which use a hill-climbing approach and are less susceptible to problems like policy collapse. By setting the greedy criterion to only accept better solutions, SA can achieve monotonically better performance, whereas RL cannot.

Thus, it is best to generate an initial solution using RL and continuously improve this solution using heuristic algorithms like SA. Furthermore, it is advantageous for RL to learn how to provide an optimal initialization to SA to maximize the performance of both techniques in tandem.

Figure 1. The RLHO framework.

In this paper, we address these two points by introducing the Reinforcement Learning Driven Heuristic Optimization Framework (RLHO), shown in Figure 1. There are two components in this framework: the RL agent and the heuristic optimizer (HO). The RL agent generates solutions that act as initialization for HO, and HO searches for better solutions starting from the solution generated by RL. After HO finishes executing (upon convergence or after a set number of search steps), it returns the found solution and the reward to the RL agent. Our learning process is an alternating loop of (1) generating initial solutions with RL and then (2) searching for better solutions with HO. To the RL agent, HO is part of the environment.

We apply RLHO to the bin packing problem where the RL agent is modeled using Proximal Policy Optimization (PPO) (Schulman et al., 2017) and HO is simulated annealing (SA). We demonstrate that not only does combining PPO and SA yield superior performance to PPO alone, but also that PPO is actually able to learn to generate better initialization for SA. By observing the end performance of SA on a problem, PPO can generate inputs to SA that improve the performance of SA itself.

In summary, our contributions in this paper are as follows:

  • We demonstrate a novel approach to combinatorial optimization where reinforcement learning and heuristic algorithms are combined to yield superior results to reinforcement learning alone on a combinatorial optimization problem.

  • We demonstrate that we can train reinforcement learning to enable heuristic algorithms to achieve superior performance than when they are decoupled on a combinatorial optimization problem.

1.1. Related Work

Reinforcement learning and evolutionary algorithms achieve competitive performance on MuJoCo tasks and Atari games

(Salimans et al., 2017). The idea of applying evolutionary algorithms to reinforcement learning (Moriarty et al., 1999) has been widely studied. (Khadka and Tumer, 2018) proposes a framework to apply evolutionary strategies to selectively mutate a population of reinforcement learning policies. (Maheswaranathan et al., 2018; Pourchot and Sigaud, 2018) use a gradient method to enhance evolution.

Our work is different from the above as we apply deep reinforcement learning to generate better initializations for heuristic algorithms. The heuristic part in the RLHO framework only changes the solution, rather than the parameters of the policy. To our knowledge, our work is the first that does this.

2. Combining PPO and SA

2.1. Preliminary Discussion

What is the best way to combine an RL agent and a heuristic algorithm? A first approach is to allow an RL agent to generate an initial solution to a combinatorial optimization problem, then execute a heuristic algorithm to refine this initial solution until convergence, and then train the RL policy with the rewards obtained from the performance of the heuristic algorithm. This would delineate one episode. However, on large problems, heuristics take a long time to converge. Thus, in our approach, we allow the heuristic algorithm to run for a only limited number of steps in one episode.

We now introduce the RLHO algorithm.

2.2. The RLHO Algorithm

Our approach is a two-stage process as detailed in Algorithm 1: at the start of each episode, first run RL for steps to generate an initial solution . Then, run pure HO for steps starting from . Finally we update RL with the cost of the final solution. We repeat this process with a fresh start every time.

Our action space is designed as perturbing the currently available solution. In our bin packing problem discussed in more detail in Section 3, the agent is first presented with a randomly initialized assignment of items to bins. The environment around the bin packing problem will then present the agent with an item . The agent then needs to decide which other item to swap locations with item based on the current state.

For the design of the reward function, we define the intermediate reward as the difference between the cost of the previous solution and the cost of the current solution, as the goal is to minimize cost.

When the agent’s action space consists of perturbations, the MDP for the combinatorial optimization problem results in an infinite horizon. We are not privileged with that would normally denote the terminal state of the MDP. The agent is free to continue perturbing the state forever, and thus, is undefined. However, our agents are trained with a finite number of steps , so

would normally need to be estimated with a baseline such as a value function. The value function is a poor estimator because it does not accurately estimate the additional expected performance of the agent in the limit of time, because we simply don’t possess such data.

To address this, a novelty in our approach is to obtain a better estimate for using the performance of HO. The additional optimization provided by HO gives us an additional training signal to RL as to how RL actions contribute to the future return provided by HO. Therefore, RL can be trained by two signals in RLHO: (1) the intermediate reward at each RL step, and (2) the discounted future reward provided by HO conditioned on the initialization provided by RL. This approach provides RL with a training signal to generate better initialization for HO.


As shown in Equation (1), we can replace the infinite horizon term with a stationary, tractable value . We obtain by running pure HO for y steps starting from , and then taking the difference between the cost of and the cost of the final solution as an estimate for the value of .

  Initialize the replay buffer and the solution randomly
  Initialize the number of RL steps and the number of SA steps in one episode
  for iteration  do
     Rollout using RL policy for steps and store the transitions in , obtaining initial solution from RL
     Run HO on for steps to obtain
     Get the new reward as the difference of costs of and
     Train RL using

     Reset the solution and hyperparameters of HO

  end for
Algorithm 1 The RLHO algorithm
  Initialize the temperature , the maximal number of steps of SA in one path,
  Obtain the PPO solution
  for  do
     Perturb the current solution randomly, get
     if  then

with probability

     end if
  end for
Algorithm 2 Simulated Annealing

3. Performance Evaluation

We validate our methods on the bin packing problem. In this section we first introduce the bin packing problem, and then discuss the performance gain obtained when combining the RL part (PPO) and the heuristic optimizer (SA) in our RLHO framework. The details of SA are shown in Algorithm 2.

3.1. The Bin Packing Problem

Bin packing is a classical combinatorial optimization problem where the objective is to use the minimum number of bins to pack items of different sizes, with the constraint that the sum of sizes of items in one bin is bounded by the size of the bin. Let denote the number of bins and the number of items, and

denote the vector representing the of sizes of all items. Let

be the 0/1 matrix that represents one assignment of items to bins (a packing), i.e., means the item is put in the bin . Given a packing , let denote the cost, the number of bins used in this solution, i.e., .

3.2. Learning to Generate Better Initializations

We evaluate the ability of RLHO to generate better initializations for heuristic algorithms. In this set of experiments, during training, we allow RLHO to generate an initialization using RL for timesteps, and then run HO using timesteps. After training episodes, we take the initialization generated by the RL step of RLHO and use it to initialize a HO that will run until convergence.

Table 1 and Table 2 count the average number of used bins of the best solution during training with and respectively, over 5 independent trials. We also report results where random perturbations (Random) are used instead of RL to generate the initial solutions as a baseline. We collect results for 10000 iterations of running RLHO and Random until convergence.

Our results show that RLHO does learn better initializations for HO than Random, and the performance gap increases with larger problem sizes. The training signal provided by the HO performance used to augment the value function indeed does help RLHO allow heuristic algorithms to perform better. Most interestingly, when the RL part of RLHO is trained using signal from SA that is run for 5000 steps, the initialization it generates is still effective for SA that runs until convergence, e.g. millions of timesteps.

RLHO Random, then HO
100 59 69
200 128.4 141
500 347 361
1000 714 734
Table 1. Average cost of the best solution found by each algorithm with HO steps.
RLHO Random, then HO
100 59 69
200 127 141
500 344.4 359
1000 711 731
Table 2. Average cost of the best solution found by each algorithm with HO steps.

3.3. Having RL and HO Work Together

Now we extend our experimental evaluation to answer the following question: can HO help RL train better? Can running HO after an RL training step help RL explore better states?

We adjust RLHO to perform alternating optimization on a combinatorial optimization problem. RL will generate a solution, which will then be optimized by HO. RL will then be trained with additional signal from HO. The same solution will then be passed back to RL for continuous optimization. This differs from our previous approach because we do not reset the solution on each episode. The greedy nature of HO will perform hill climbing, allowing RL to see more optimal states throughout training.

50 22 22
100 50 50
200 102 101
500 283 266
1000 613 601
Table 3. Average cost of the best solution found by each algorithm with
50 22 22
100 50 50
200 102 101
500 283 265
1000 613 572
Table 4. Average cost of the best solution found by each algorithm with
Figure 2. Training performance on 500bins.

We run the two algorithms side-by-side to evaluate our approach. Table 3 and Table 4 show the average number of used bins of the best solution (over 5 independent runs) searched by both algorithms during training. For RL, we simply keep running PPO without any SA. In RLHO, PPO learns from SA. We choose to set , and the initial temperature of SA to be . We compare the performance of two algorithms in terms of the number of steps the RL policy performs, with the hyperparameters of the RL part of both approaches kept constant. We also evaluate our approaches on different sizes of the bin packing problem. We report the results until 2000 iterations run for the alternating optimization.

The convergence curves of all approaches are shown in Figure 2. We conclude that the pure RL algorithm is more sample efficient but performs worse as the RL algorithm has no additional outlet for exploration. RLHO achieves better performance because it adopts the HO to perform better exploration.

4. Conclusion

In this paper, we propose a novel Reinforcement Learning Driven Heuristic Optimization framework that applies reinforcement learning to learn better initialization for heuristic optimization algorithms. We present the RLHO learning algorithm which builds upon Proximal Policy Optimization and Simulated Annealing. Experimental results on the bin packing problem show that the RLHO learning algorithm does indeed learn better initialization for heuristic optimization, outperforming pure reinforcement learning algorithms. Our approach can be applied towards combinatorial optimization problems that have real world applications.

We hope to further evaluate our methodology on a broad range of other combinatorial optimization problems such as TSP, graph partitioning, and integer programming, with other heuristic algorithms such as evolutionary strategies to demonstrate the power of our approach. We also plan on providing a better and theoretically motivated estimator of heuristic performance to the reinforcement learning agent.