1 Introduction and Motivation
Reinforcement learning (RL) algorithms that combine neural networks and tree search have delivered outstanding successes in twoplayer games such as go, chess, shogi, and hex. One of the main strengths of algorithms like AlphaZero [13] and Expert Iteration [1] is their capacity to learn tabula rasa through selfplay. Historically, RL with selfplay has also been successfully applied to the game of Backgammon [14]. Using this strategy removes the need for training data from human experts and always provides an agent with a wellmatched adversary, which facilitates learning.
While selfplay algorithms have proven successful for twoplayer games, there has been little work on applying similar principles to singleplayer games/problems [10]. These games include several wellknown combinatorial problems that are particularly relevant to industry and represent realworld optimization challenges, such as the traveling salesman problem (TSP) and the bin packing problem (BPP).
This paper describes the Ranked Reward
(R2) algorithm and results from its application to 2D and 3D BPP formulated as a singleplayer Markov decision process (MDP). R2 uses a deep neural network to estimate a policy and a value function, as well as Monte Carlo tree search (MCTS) for policy improvement. In addition, it uses a reward ranking mechanism to build a singleplayer training curriculum that provides advantages comparable to those produced by selfplay in competitive multiagent environments.
The R2 algorithm offers a new generic method for producing approximate solutions to NPhard optimization problems. Generic optimization approaches are typically based on algorithms such as integer programming [7], that provide optimality guarantees at a high computational expense, or heuristic methods that are lighter in terms of computation but may produce unsatisfactory suboptimal solutions. The R2 algorithm outperforms heuristic approaches while scaling better than optimization solvers. We present results showing that, on 2D and 3D BPP, R2 performs better than the same deep RL algorithm without a ranked reward mechanism and also better than MCTS [5], the Lego heuristic algorithm [8]
and linear programming with barrier functions
[7]. We also analyze the reward ranking mechanism. In particular, we evaluate different ranking thresholds for deciding whether an episode/game should be considered a win or a loss and the effects on the overall learning process.In Section 2
of this paper, we summarize the current stateoftheart in deep learning for games with large search spaces. Then, in Section
3, we present a singleplayer MDP formulation of the bin packing problem. In Section 4 we describe the R2 algorithm using deep RL and tree search along with a reward ranking mechanism. Section 5 presents our experiments and results, and Section 6 discusses the implications of using different reward ranking thresholds. Finally, Section 7 summarizes current limitations of our algorithm and future research directions.2 Related Work
Combinatorial optimization problems are widely studied in computer science and mathematics. A large number of them belongs to the NPhard class of problems. For this reason, they have traditionally been solved using heuristic methods [11, 4, 6]. However, these approaches may need handcrafted adaptations when applied to new problems because of their problemspecific nature.
Deep learning algorithms potentially offer an improvement on traditional optimization methods as they have provided remarkable results on classification and regression tasks [12]. Nevertheless, their application to combinatorial optimization is not straightforward. A particular challenge is how to represent these problems in ways that allow the deployment of deep learning solutions. One way to overcome this challenge was introduced by Vinyals et al. [15] through Pointer Networks
, a neural architecture representing combinatorial optimization problems as sequencetosequence learning problems. Early Pointer Networks were trained using supervised learning methods and yielded promising results on the TSP but required datasets containing optimal solutions which can be expensive, or even impossible, to build. Using the same network architecture, but training with actorcritic methods, removed this requirement
[3].Unfortunately, the constraints inherent to the bin packing problem prohibit its representation as a sequence in the same way as the TSP. In order to get around this, Hu et al. [9] combined a heuristic approach with RL to solve a 3D version of the problem. The main role of the heuristic is to transform the output sequence produced by the RL algorithm into a feasible solution so that its reward signal can be computed. This technique outperformed previous welldesigned heuristics.
2.1 Deep Learning with Tree Search and SelfPlay
Policy iteration algorithms that combine deep neural networks and tree search in a selftraining loop, such as AlphaZero [13] and Expert Iteration [1], have exceeded human performance on several twoplayer games. These algorithms use a neural network with weights to provide a policy and/or a state value estimate for every state
of the game. The tree search uses the neural network’s output to focus on moves with both high probabilities according to the policy and highvalue estimates. The value function also removes any need for Monte Carlo rollouts when evaluating leaf nodes. Therefore, using a neural network to guide the search reduces both the breadth and the depth of the searches required, leading to a significant speedup. The tree search, in turn, helps to raise the performance of the neural network by providing improved MCTSbased policies during training.
Selfplay allows these algorithms to learn from the games played by both players. It also removes the need for potentially expensive training data, often produced by human experts. Such data may be biased towards human strategies, possibly away from better solutions. Another significant benefit of selfplay is that an agent will always face an opponent with a similar performance level. This facilitates learning by providing the agent with just the right curriculum in order for it to keep improving [2]. If the opponent is too weak, anything the agent does will result in a win and it will not learn to get better. If the opponent is too strong, anything the agent does will result in a loss and it will never know what changes in its strategy could produce an improvement. The main contribution of the R2 algorithm is a relative reward mechanism for singleplayer games, providing the benefits of selfplay in singleplayer MDPs and potentially making policy iteration algorithms with deep neural networks and tree search effective on a range of combinatorial optimization problems.
3 Bin Packing as a Markov Decision Process
The classical bin packing problem involves packing a set of items into fixedsized bins in a way that minimizes a cost function, e.g. the number of bins required. The work presented here considers a variation of the problem where the objective is to pack a set of items into a single bin while minimizing its surface, like in the work of Hu et al. [9, 8].
3.1 Bin Packing Problem
The problem involves a set of cuboid shaped items where and denote the length, width and height of item . Items can be rotated of along and axis, and denotes how the th item is rotated. The bottomleftfront corner of the th item placed inside the bin is denoted by with the bottomleftfront corner of the bin set to . The problem also includes additional constraints, complexifying the environment and reducing the number of available positions in which an item can be placed. In particular, items may not overlap and an item’s center of gravity needs physical support. A solution to this problem is a sequence of quintuplets where all items are placed inside the bin while satisfying all the constraints. The objective is then to minimize the surface of the minimal bin which contains all the items. The problem can also be reduced to 2D, where the bin is of shape and each item have only two possibilities of orientation. The objective is then to minimize the perimeter of the minimal bin.
3.2 Markov Decision Process
As opposed to Hu et al. [9, 8], which address the BPPs via sequencetosequence methods, we formulate the problem as an MDP, where state encodes all the items and their current placement if placed and action encodes the possible positions and orientations of the unplaced items. The goal of the agent is to select actions in a way that minimizes the cost. This is reflected in the design of the reward . As defined in Equation 1, all nonterminal states receive a reward of while terminal states receive a reward function of the final solution’s quality.
(1) 
where costs and denote respectively the cost of the minimal bin at terminal state and of an ideal cube (or square) whose volume (or area) equals to the sum of the volumes (or areas) of all items. The definition of the cost of a bin is:
4 The R2 Algorithm
When using selfplay in twoplayer games, a funny agent faces a perfectly suited adversary at all times because no matter how weak or strong it is, the opponent always provides just the right level of opposition for the agent to learn from [2]. The R2 algorithm reproduces the benefits of selfplay for generic singleplayer MDPs by reshaping the rewards of a single agent according to its relative performance over recent games. A detailed description is given by Algorithm 1. Below we present the ranking mechanism and network model used in detail.
4.1 Ranked Rewards
The ranked reward mechanism compares each of the agent’s solutions to its recent performance so that no matter how good it gets, it will have to surpass itself to get a positive reward. In particular, R2 uses a sizelimited buffer to record recent MDP rewards and calculate a threshold value based on a given percentile , e.g., the threshold value is the MDP reward value closest to the th percentile of the MDP rewards in the buffer. The agent’s MDP reward is reshaped to a ranked reward according to whether or not it surpasses the threshold value:
(2) 
where
is sampled from a binary random variable
such that when , equals to randomly. This way, the player is provided with samples of recent games labeled relatively to the agent’s current performance, providing information on which policies will improve its present capabilities. The random variable is used to break ties and assure constantlearning. Indeed, if we set to when , the agent does not have an incentive to beat the threshold since it can obtain positive rewards by staying at its current performance level. The ranked rewards are then used as targets for the value estimation neural network and as the value to backup for terminal nodes during MCTS.4.2 Neural Network Architecture
The input of the neural network consists of the set of feasible actions where each action consists of features describing the chosen item, id and orientation, as well as the placement location. The network architecture satisfies two critical requirements. First, it is permutation invariant, i.e. any permutation of the input set results in the same output permutation. Second, the model is able to process input sets of any size since the size of the available action space varies as the items are being placed.
Each action in the action space is fed independently into a feedforward network taking fixedsize inputs. The resulting featurespace embeddings are aggregated using pooling operations. The final output is obtained by combining these with the embeddings through further nonlinear processing to obtain the agent’s policy and its statevalue estimate.
5 Experiments and Results
To validate our approach and evaluate its effectiveness, we first considered 2D and 3D BPP with items. The items were generated by repeatedly dividing a square/cube of size . The process for generating items randomly is presented in detail in Appendix A. The training process is as follows: at each iteration, problems are generated and solved by R2 with a reward buffer of size , used to define the threshold . All MCTS instances perform simulations per move. The reshaped rewards alongside the MCTSimproved policies are stored in a dataset and used during training. The neural network is trained by gradient descent (Adam) using minibatches of size , uniformly sampled from the last games. At each iteration, steps of gradient descent are performed.Each experiment ran on a NVIDIA Tesla V100 GPU card for up to two days.
5.1 Ranked Reward with Different Thresholds
We first compared the performance of the R2 algorithm with three different percentiles: , and . We also included a version using the MDP reward directly without ranking as the target for the value function estimate. The learning curves are presented in Figure 1 for games with 10 items in 2D and 3D. R2 with an percentile of achieved the best training performances in both cases. Regardless of the threshold, R2 outperforms RankFree in both mean rewards and optimality percentages with a faster convergence. We later evaluated the different threshold on problems with 20, 30 and 50 items. On these larger problems, the threshold produced the best final performance. The details of these evaluations are given in Appendix C.
5.2 Comparison with Other Methods
To evaluate the performance of R2, we compared Rank with a set of baselines as comparison: trained neural network agent without MCTS; a supervised agent; a plain MCTS agent using MonteCarlo rollouts for statevalue estimation [5]; the Lego heuristic search algorithm [9, 8]; and a solver agent using integer programming [7] (see appendix B for further details). We used games as the test set for all the algorithms.
As shown in Figure 2a and Figure 2b, R2 always outperformed its alternatives. Especially, we can observe that the trained network performed at least as good as the pure MCTS algorithm and Lego heuristic, which proves that the architecture of the neural network is well designed and it is capable of learning good policies. With the help of specifically designed training data, the supervised agent achieved also a superior performance compared to the pure MCTS and Lego heuristic on small instances but the performance decreased quickly as the size of problems increased. Concerning the Gurobi solver, it was given five minutes, around the same amount of time that R2 algorithm used. The solver was able to find optimal solutions for problems with items, but for problems with more items, sometimes feasible solutions could barely be found. Hence the rewards vanished to zero (which are not shown in the figures). Note that the algorithm performed better in 2D with items since the total volume is fixed and the items sizes are reduced. Therefore the problems become simpler.
Furthermore, we tested the trained network with different percentiles on the same set of problems to compare the generalization ability. The results are presented in Appendix C and we found that although Rank achieved better training performance, Rank was more robust and had even a better test performance. For illustration, we provide in Figure 3 a visualization of solutions given by Lego and Rank75% in 2D and 3D.
6 Discussion
In this section, we present our analyzes of two critical facets of learning with ranked rewards. The first if the issue of constructing an effective ranking when the agent plays game instances with different level of difficulty. The second is the issue of identifying an optimal threshold and analyzing the potential benefits of, and problems with, high and low ranking thresholds.
6.1 Ranking the Performance on Games of Different Difficulty
When using selfplay in twoplayer games, the sequence of actions which lead to victory is superior to the sequence leading to a loss. This pushes the agent’s policy in the direction of the winner’s actions. However, when using ranked reward, there is no direct relationship between the sequences of actions of the agent on two different games. The agent might have performed poorly on one game because the difficulty level of that particular game was higher than most other games. This effect introduces significant amounts of noise to the reward signal.
A workaround would be to include the difficulty level in the ranking process, such that poor game outcomes still obtain good rankings for complex game instances. However, this approach would require access to the difficulty level of the games, which is unlikely to happen. A second approach would incorporate randomness, e.g. additional noises into the prior distribution of the policy neural network, into the agent’s behavior while making it play the same game multiple times. By this mechanism, we can assure comparability of the game outcomes and correlations between these and the agents’ policies. We tried the second approach on the bin packing problem but no further improvement was noticed due to the relative similarity of the difficulty level of the games^{1}^{1}1
The empirical reward distribution of the Lego heuristic corresponds to a bellshape with a small standard deviation, which suggests that different games have relatively the same difficulty level.
. The Ranked Reward algorithm as described in this paper is therefore appropriate for problems in which different problem instances have relatively the same difficulty level. For complex problems in which difficulty could vary widely from one instance to another, R2 would certainly benefit from applying the mechanism above to ensure comparability in the game outcomes.6.2 The Effects of Ranking Thresholds on Learning
Although the performances of R2 are relatively robust to the percentile used, we studied the difference of learning performances when using other percentile values. We analyzed and interpreted the evolution of the reward distribution over time for four different cases, rankfree and values of , and on 2D BPP with 10 items, as shown in Figure 4.
Figures 4a and 4b show that, although better solutions dominate, optimal solutions are not picked up convincingly in the rankfree and ranked cases. In the rankfree case, the difference between the optimal MDP reward and the nearest suboptimal MDP reward is relatively small, potentially making it difficult to distinguish between them effectively. In the ranked case, Figure 4c, all the games in the top half of the buffer receive a ranked reward of . This provides positive feedback to a significant number of suboptimal games and it is potentially this effect that makes the convergence slow in this case. In the ranked case, only the top quarter of the buffer receive a ranked reward of . This effectively picks up the optimal solutions and expels suboptimal solutions from the buffer almost completely. The ranked case, Figure 4d, also converges quickly, but less quickly than the ranked case. In this case, a smaller number of good solutions will receive positive feedback, giving the learning a much smaller reward signal initially. This could be the cause of the slower convergence rate, though optimal solutions always receive a positive reward.
The impact of the percentile on the performance follows our intuitive understanding of human learning. Setting the threshold at is equivalent to making the agent play against an opponent of the same level, as it has a predetermined chance of winning. Increasing the percentile value corresponds to improving the opponent’s level, as it makes it harder to obtain a reward of . In our context, when the percentile changes from to , the probability of winning falls to . In general, higher thresholds lead to faster learning, i.e., the proportion of highreward games increases faster. However, Figure 4d shows that, for a threshold of , a larger residue of lowreward games remains. Although this effect is stronger for 3D problems. These instabilities could explain the weaker final performance of the higher thresholds. To explain this, we can hypothesize that if the opponent is too strong, the learning process will suffer as the agent can very rarely affect the game outcome even when it plays significantly better than its current mean performance level.
7 Conclusion and Future Work
The results presented in this work show that R2 outperforms the selected alternatives both on the 2D and 3D bin packing problems with 10, 20, 30 and 50 items. In particular, the capacity of the algorithm to outperform its competitors in large instances makes it suitable for solving reallife problem instances.
By ranking the rewards obtained over recent games, R2 provides a thresholdbased relative performance metric. This enables it to reproduce the benefits of selfplay for singleplayer games, removing the requirement for training data and providing a wellsuited adversary throughout the learning process. Consequently, R2 outperforms the selected alternatives as well as its rankfree counterpart, improving on the performance of the best alternative, the Gurobi solver, by more than when using a threshold value of . As the number of items grows, the difference can reach up to . An analysis of the effects of different percentiles has shown that higher thresholds perform better up to a point after which learning becomes unstable and performance decreases.
For now, our implementation of the bin packing problem only considers problems that do not contain any spare space, i.e., square packings with no gaps. Even though this helps us to evaluate the algorithm’s performance, it introduces an undesirable bias. Future research should evaluate the algorithm on a wider range of problems, for which the optimal solution is unknown and not necessarily square.
The R2 algorithm is potentially applicable to a wide range of optimization tasks, though it has so far been used only on the bin packing. In the future, we will consider other optimization problems such as the Traveling Salesman Problem to further evaluate its effectiveness.
References
 [1] Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. In Advances in Neural Information Processing Systems (NIPS) 30, pages 5360–5370. 2017.
 [2] Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, and Igor Mordatch. Emergent complexity via multiagent competition. arXiv:1710.03748, 2017.
 [3] Irwan Bello, Hieu Pham, Quoc V. Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning. CoRR, abs/1611.09940, 2016.
 [4] V. Boyer, M. Elkihel, and D. El Baz. Heuristics for the 0–1 multidimensional knapsack problem. European Journal of Operational Research, 199(3), 2009.
 [5] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton. A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1–43, 2012.
 [6] Alberto Colorni, Marco Dorigo, Francesco Maffioli, Vittorio Maniezzo, Giovanni Righini, and Marco Trubian. Heuristics from nature for hard combinatorial optimization problems. International Transactions in Operational Research, 3(1):1–21, 1996.
 [7] LLC Gurobi Optimization. Gurobi optimizer reference manual, 2018.
 [8] Haoyuan Hu, Lu Duan, Xiaodong Zhang, Yinghui Xu, and Jiangwen Wei. A multitask selected learning approach for solving new type 3D bin packing problem. arXiv:1804.06896, 2018.
 [9] Haoyuan Hu, Xiaodong Zhang, Xiaowei Yan, Longfei Wang, and Yinghui Xu. Solving a new 3D bin packing problem with deep reinforcement learning method. arXiv:1708.05930, 2017.
 [10] Thomas M. Moerland, Joost Broekens, Aske Plaat, and Catholijn M. Jonker. A0C: Alpha zero in continuous action space. arXiv:1805.09613, 2018.
 [11] César Rego, Dorabela Gamboa, Fred Glover, and Colin Osterman. Traveling salesman problem heuristics: Leading methods, implementations and latest advances. European Journal of Operational Research, 211(3):427–441, 2011.
 [12] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015.
 [13] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy P. Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by selfplay with a general reinforcement learning algorithm. arXiv:1712.01815, 2017.
 [14] Gerald Tesauro. TDgammon, a selfteaching backgammon program, achieves masterlevel play. Neural Computation, 6(2):215–219, 1994.
 [15] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In Advances in Neural Information Processing Systems (NIPS) 28, page 2692–2700, Montreal, Quebec, Canada, December 712 2015.
Appendix A Generation of Bin Packing Problems
Due to the lack of existing datasets for the bin packing problems, we generated instances artificially via splitting a cube (or square) randomly. The detailed algorithm is given in Algorithm 2. Especially, the information of optimal solution is not used in the definition of the problem and the definition of the final reward is applicable for problems without the knowledge of the optimal solution.
Appendix B Benchmark Algorithms
Details of the benchmark algorithms

Plain MCTS The plain MCTS agent used simulations per move just like R2 and executed a single Monte Carlo rollout per simulation to estimate state values.

Lego Heuristic The Lego algorithm worked sequentially by first selecting the item minimizing the wasted space in the bin, and then selecting the orientation and position of the chosen item to minimize the bin size.

Supervised Learning The BPP instances were generated with a known optimal solution for each problem, we designed a Legolike heuristic algorithm defining a corresponding optimal sequence of actions for such solution. We used the stateaction pairs to train the policyhead of the R2 neural network as a oneclass classification problem: given state .

Gurobi Solver With exactly the same constraints as reinforcement learning environments, we wrote two mathematical models for 2D and 3D problems and used Gurobi solver to solve them. Precisely, all constraints are linear and the objective is linear for 2D and quadratic for 3D. Solver were given CPUs and five minutes in total (roughly the same amount of time as R2 agents with MCTS) and it reported the best found feasible solution.
Appendix C Comparison of the Results for Different Threshold Values
Here we present the performance of the networks trained with the R2 algorithm, without doing MCTS simulations.
[width=8em,trim=l]AlgoItems  10  20  30  50 

RankFree 

Rank  
Rank  
Rank 
[width=8em,trim=l]AlgoItems  10  20  30  50 

RankFree  
Rank  
Rank  
Rank 
Comments
There are no comments yet.