Atari-fying the Vehicle Routing Problem with Stochastic Service Requests

11/14/2019 ∙ by Nicholas D. Kullman, et al. ∙ SAINT LOUIS UNIVERSITY University of Tours 0

We present a new general approach to modeling research problems as Atari-like videogames to make them amenable to recent groundbreaking solution methods from the deep reinforcement learning community. The approach is flexible, applicable to a wide range of problems. We demonstrate its application on a well known vehicle routing problem. Our preliminary results on this problem, though not transformative, show signs of success and suggest that Atari-fication may be a useful modeling approach for researchers studying problems involving sequential decision making under uncertainty.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep reinforcement learning (RL) has seen much recent success in tasks that involve sequential decision making under uncertainty. These tasks span a variety of domains, including natural language processing

(He et al., 2015), image recognition (Caicedo and Lazebnik, 2015), healthcare (Liu et al., 2017), energy (Mocanu et al., 2018), taxi dispatching (Kullman et al., 2019), and autonomous driving (Sallab et al., 2017). Perhaps the most well-known application domain is that of playing games. Deep reinforcement learning has been used to train agents capable of superhuman performance in games such as chess (Silver et al., 2018), Go (Silver et al., 2018), Doom (Lample and Chaplot, 2017), Texas Hold’em Poker (Heinrich and Silver, 2016), StarCraft II (Vinyals et al., 2019), and, of particular interest here, Atari (Mnih et al., 2015).

In the case of Atari, Mnih et al. established a single solution method — a single agent architecture — that was capable of outperforming humans on the majority of a set of 49 Atari games. The use of a single architecture to accomplish this feat is notable as these games differ in their appearance, goals, rewards, actions, etc. Given the diversity of the games on which this architecture was tested and shown to excel, one wonders whether it would perform comparably well on any game with a similar (Atari-like) format. Perhaps not, in which case it would suggest that there is something unique about the original set of Atari games that makes them susceptible to exploitation by this particular solution method. However, this seems unlikely. Rather, it seems more likely that given a game formatted similarly to those in the original set of Atari games from Mnih et al. (2015), a solution method like the agent architecture from that study would be able to conquer this new game as well.

Should this premise hold, it presents the opportunity of using this agent architecture to solve research problems, permitted that those problems can be properly formatted as an Atari-like game.111 We do not provide a precise definition of what makes a game Atari-like, but broadly mean that it is a two-dimensional third-player game with a total pixel count on the order of or less. We explore this opportunity here. For shorthand, we will refer to the process of modeling a research problem as an Atari-like game as Atari-fying. Our primary contribution in this work is demonstrating the Atari-fication of a well-known problem from the field of transportation, namely, the vehicle routing problem with stochastic service requests (VRPSSR). We discuss the merit of these efforts, their generalizability, the expected benefits and limitations of this approach, and we speculate on additional research problems for which Atari-fication may prove successful. Broadly, we hope to show that there is an opportunity to extend the groundbreaking achievements from the deep reinforcement learning community to research problems in other domains through reformulations of classical problem models.

2 Background & Related Work

We provide a brief background on deep reinforcement learning (§2.1) and vehicle routing problems (§2.2), then look at previous examples of the modeling of research problems as videogames (§2.3).

2.1 Deep Reinforcement Learning

Reinforcement learning (RL), as defined by Sutton et al. refers to the process through which an agent, sequentially interacting with some environment, learns what to do (that is, a policy that determines how to map states to actions) so as to maximize a numerical reward signal from the environment (Sutton et al., 1998). Often what the agent seeks to learn is the value of choosing a particular action from some state , known as the state-action pair’s Q-value (), equal to the immediate reward plus the expected sum of future rewards earned as a result of taking action from state . With knowledge of the Q-values for all possible state-action pairs, the agent’s policy is then to choose the action with the largest Q-value. In practical problems, because the number of unique state-action pairs an agent could encounter in its environment is too large to learn and store a value for each, a functional approximation of these Q-values is learned. There are competing methods to learn this value function approximation

(VFA). When deep artificial neural networks (ANNs) are used for this VFA, the method is called

deep reinforcement learning or deep Q-learning, and the ANN used in this process is referred to as the deep Q network (DQN). While deep RL has a history dating back at least to Farley and Clark (1954), it has recently seen an uptake in usage, thanks to advances in computational performance, data availability, and subsequent methods development.

Deep RL offers flexibility in how the agent perceives the state, capable of effective VFA under a variety of representations (that is, under a variety of input formats to the DQN). Often the state representation is simply a list of relevant quantities describing, e.g., position and velocity. Such is usually the case in classical RL control problems and in applications such as robotics (see, e.g., those from Brockman et al. (2016)). An alternative representation is that of an image or set of images: the agent receives a visual description of the environment from which it makes decisions. This approach has a biological precedent, as it is how humans solve many problems — we process received visual information to understand and respond to a situation. This visual state representation is what was used in Mnih et al.’s work on Atari and, as such, is the representation we adopt here.

2.2 Vehicle Routing Problems

Vehicle Routing Problems (VRPs) are a broad class of NP hard problems that seek to determine the optimal routes for vehicles to follow to perform some task, such as the delivery of goods to a set of customers. A vast body of literature exists for VRPs, and VRPs remain an active research topic in operations research (see Toth and Vigo (2002) for a summary). VRPs come in many variants, accommodating various combinations of constraints and uncertainties, such as the VRP with a capacitated vehicle (CVRP), the VRP with customers that have time-windows (VRPTW), the VRP with an electric vehicle (E-VRP), the E-VRP with public charging stations (E-VRP-PP), the VRP with stochastic service requests (VRPSSR), with stochastic customer demands (VRPSD), with stochastic travel times (VRPSTT), with stochastic demands and time windows (VRPSDTW), etc.

Stochastic variants of the VRP are of particular interest here, because they more naturally lend themselves to Atari-fication: first, because the way that information is revealed over time in these problems is similar to the appearance of new obstacles or enemies in videogames; and second, they permit solution methods that allow for responding to this information (i.e., dynamic routing), similar to how players can react to new obstacles in videogames (e.g., dodge a bullet and return fire). Here we focus on the VRP with stochastic service requests (VRPSSR). The problem has been well studied in the literature, for example, in Gendreau et al. (1999), Bent and Van Hentenryck (2004), and Ulmer et al. (2018). Inspiration for the exact problem addressed here, described in more detail in §3, comes specifically from Ulmer et al. (2018). While the problem has been addressed with multiple solution methods, the majority are based on reoptimization, a method in which a math program is constructed and solved every time a decision is needed. In addition to the Atari-fication of this problem, our approach here is also novel in that it marks the first application of deep RL to this problem.

2.3 Research Problems as Videogames

Successful attempts at modeling research problems as videogames exist, although these games have been developed for different purposes than that proposed here. Typically, these games have aimed to crowdsource human efforts to contribute to a research problem whose scale or complexity renders it difficult to solve via traditional algorithmic approaches. Human input in these games is then either used directly as a smaller component to a larger solution (as in Eyewire (Sterling, 2012)), or it is used to guide an underlying algorithm, often by highlighting regions of the solution space in which to concentrate efforts (as in Phylo (Kawrykow et al., 2012), FoldIt (Khatib et al., 2011), and Quantum Moves (Lieberoth et al., 2015)). This is in contrast to our proposed approach in which the formulation as a videogame serves to translate the research problem so as to be amenable to solution by a different class of algorithms (i.e., deep RL instead of reoptimization or traditional VFA).

3 Atari-fying the VRPSSR

Here we demonstrate the Atari-fication of the vehicle routing problem with stochastic service requests (VRPSSR), as defined in Ulmer et al. (2018).

3.1 Problem description

The agent in this problem is responsible for dynamically routing a vehicle to serve customer service requests that arise randomly throughout some service region over the course of a work day of duration . The locations of potential customer requests are known in advance, but it is not known which customers will request service or when they will do so. The problem begins with the vehicle at the depot at time , and it must return to the depot before the end of the day at time . The agent moves the vehicle among the customers and the depot (defined to be node ) via edges in some underlying graph (which varies by formulation, as discussed below). We assume known fixed travel times along the edges. At each step, beginning at time , the agent either instructs the vehicle to wait at its current location, or it selects an adjacent node in the graph to which to move the vehicle. The next step occurs either when the vehicle arrives to a new node (if the agent chose to move the vehicle) or after some predefined waiting time (if the agent chose to wait). If the vehicle visits a node representing a customer that is currently requesting service, then the customer is marked as having been served, and the vehicle earns some reward. The agent’s objective is to find a policy that dynamically routes the vehicle so as to maximize the expected sum of rewards.

3.2 Formulations’ graphs

The abstraction used for the problem’s underlying graph differs between the traditional and Atari-fied formulations. In the traditional formulation, the agent moves the vehicle among the customers and depot via edges in the complete graph with vertices . In the Atari-fied formulation, the agent moves the vehicle along edges in the graph , a Manhattan-style grid representation of , where some nodes (intersections) represent customer locations. These representations are shown in Figure 1. While the problem definition is the same for both formulations, this difference in graph yields other consequential differences. For example, if Euclidean distances and travel times are used in the traditional formulation (as is often the case), these would be Manhattan under the Atari-fied formulation . Further, while the size of the action space (the set of actions from which the agent can choose) is on the order of under (movement to each customer location, plus the depot and the wait option), it is simply under (up, down, left, right, and wait). This difference in action spaces translates to additional dynamism and flexibility when using the Atari-fied formulation. This is because the movement of the vehicle from one customer to another in the traditional formulation can effectively be preempted under the Atari-fied formulation, since, with each step of the movement between the customers, the agent can alter the vehicle’s path towards a different destination. See Figure 1 for an example.

Figure 1: Graphs for the traditional (left) and Atari-fied (right) formulations of the VRPSSR. While the vehicle in the traditional formulation must continue moving directly along the arc connecting customers A and B, under the Atari-fied formulation this decision can effectively be pre-empted, as the vehicle can choose to head towards newly-requesting customer C.

3.3 State representations

The primary difference between the traditional and Atari-fied formulations is the observation of the state that the agent receives in order to perform decision-making. In the traditional formulation, the state is a tuple consisting of the vehicle’s current location, the time , the customers currently requesting service, and the customers that have not yet requested service. However, with Atari-fication, the agent receives a visual representation of the state, similar to what one might imagine a human operator would see on a control panel. The information about the state that is displayed in this visual state representation is the same as before: the vehicle’s location (now represented on the Manhattan-grid graph ), the customers currently requesting service, and the customers that may still request service. Time may also be provided in the visual representation, perhaps (as shown at left in Figure 2) using a bar, or it may also be provided to the agent simply as a scalar that accompanies the visual representation.

How this information is chosen to be displayed is up to the modeler: details may vary, such as the shapes and colors used to represent the objects in the state, and the size (in pixels) of the display. In addition, in the original Atari work (Mnih et al., 2015), the authors included in the state the four most recent game screens (frames), rather than just the most recent. This was to allow the agent to “see” the motion of certain objects in the game, such as the movement of the ball in Pong — with just one frame, the agent does not know with what speed the ball is moving or whether its movement is towards or away from them; however, this is immediately apparent with the inclusion of additional frames. Thus, the modeler may also decide the number of previous frames to include in the state so as to sufficiently capture movement in the game. A comprehensive study of the influence of these choices on the agent’s ability to learn Atari-fied problems would be valuable, but is outside the scope of this work.

The visual rendering we use in our Atari-fication of the VRPSSR is shown in Figure 2. The basis of the rendering (the playable area) is a simplified depiction of ; each pixel represents a node, bordered by its adjacent nodes from . Around the playable area is a thin border, and above the top border is a bar that displays the relative remaining time before the vehicle must return to the depot. The colors in the rendering are in grayscale. The depot and the customers are represented by individual pixels in the playable area. The customers currently requesting service are nearly white, while the depot and the potential customers are shades of gray. Customers that have already been served are not included in the render. The vehicle is represented by the location of the open central pixel in a white 3x3 pixel square. The drawing order of these objects, from bottom to top, is depot, potential customers, vehicle, active customers.

Based on results from early experiments, in practice we do not use the rendering directly as the state representation. Instead, we use what are known as feature layers, which show “elements of the game… isolated from each other, whilst preserving the core visual and spatial elements of the game” Vinyals et al. (2017). Here, we use three feature layers: one for the vehicle, one for the active customers, and one for the potential customers. Each feature layer is a pixel array with value one for the pixel(s) containing the relevant features for that layer, and zero otherwise. The stack of these three layers, together with a scalar representing the percent of the remaining time, comprise the state representation seen by our agent. A depiction of the feature layer state representation is shown in Figure 2.

Figure 2: The rendering of the VRPSSR game (left) and its corresponding feature layer representation (right).

3.4 Computational Experiments

We test our approach on instances randomly generated in a manner motivated by Ulmer et al. (2018). We use a service region of size 256 km (16km x 16km), divided into a grid with resolution 0.25 km (grid size 0.5km x 0.5km), yielding a playable area of 32x32 pixels. We assume the vehicle travels at a constant speed of 30km/hr, yielding a time of one minute to traverse edges in . We use this as the default waiting time as well (), and we use a workday duration of minutes. When the remaining time is less than or equal to the time it would take the agent to return to the depot, we terminate the episode. This serves to ensure the resulting policy is admissible, and it also simplifies the learning process, as the agent then need only learn to serve customers and anticipate demand. The depot is centrally located, and customers are distributed among three clusters. If we take (0,0) to be coordinates of the lower left grid cell (in pixel count), then the first cluster is centered around (8,8), the second cluster around (8,24), and the third cluster around (24,16). When the vehicle visits a customer that is requesting service, it earns a reward of 10 units.

For the customer placements and request times, we begin by producing a set that are requesting service at the beginning of the episode (at time

). The number of such customers is sampled from a Poisson distribution with mean 15. Then, for each time step in

, we sample a number of customers that request service during that period according to a Poisson distribution with mean

. To place a customer, we first sample a cluster in which to locate it with probabilities

for the first, second, and third cluster, respectively. We then sample a grid cell from around the chosen cluster’s mean from a normal distribution with a standard deviation of

. We accept the customer placement if that grid cell does not already hold a customer, otherwise we repeat the draw from the normal distribution.

The rendering for these instances has a playable area of 32x32 pixels surrounded by a 2px-thick border and a 2px-tall time bar across the top. This yields a total rendering size of 36x38. However, we use the feature-layer state representation as described above (see Figure 2), yielding a state that is a tuple consisting of a stack of three 32x32 pixel arrays for the vehicle, the active customers, and the potential customers, along with a scalar for the percent of time still remaining. This state is used as input to the agent’s DQN, whose details are described in §A. We leverage three common deep RL enhancements: dueling (Wang et al., 2015) and double (Van Hasselt et al., 2016) DQN architecture (D3QN) with prioritized experience replay (Schaul et al., 2015).

The results of our computational experiments are summarized in Figure 3. With the described setup, we find that the agent is able to successfully learn and improve its performance in the VRPSSR environment, eventually achieving a mean score of 125.02 when averaged over the last 5000 training episodes. This score translates to serving 12.5 customers out of an average of 30. We suspect that these results do not yet compete with more traditional methods, although proper evaluation to reach this conclusion remains to be done. Ultimately, however, these preliminary results show promise in the process of Atari-fication, at least for the VRPSSR.

Figure 3: Agent’s performance over training episodes.

4 Discussion

The Atari-fication of the VRPSSR produced results showing promise of the application of this approach to other research problems. Here we discuss its benefits and limitations, then speculate on other problems for which it may be useful.

4.1 Benefits & Limitations

We begin by highlighting the flexibility of the Atari-fication approach: any problem — not just those in vehicle routing or transportation — that can be visualized and formatted like an Atari-like game is a candidate for this solution method. Further, it is likely that Atari-fication efforts for one problem will largely extend to other problems in its class. For example, nearly all VRPs share a similar structure — typically involving a depot, vehicle, and some set customers — so the Atari-fied representation of one VRP can likely be used with only minor modifications for another. Consider the VRPSSR as Atari-fied here. To add time-window constraints for a customer, one could simply use a different color/pixel value for that customer, depending on the time remaining in their time window (e.g., use a pixel value of 0 for a customer if their time window has not yet opened or has already closed, 10 if it is open and more than 15 minutes remain in their window, and 20 otherwise). A repository like the VRP-REP (Mendoza et al., 2014)

could be established to track and share Atari-fied formulations of research problems. This shareability reduces the upfront efforts needed to assess whether Atari-fication will be a viable solution method. In addition, much like exact commercial solvers and algorithm libraries are available for traditional VRPs and other OR problems, many libraries (e.g., Keras

(Chollet et al., 2015), OpenAI Baselines (Dhariwal et al., 2017)) exist to build and execute a deep RL agent the proposed solution method, given an Atari-fied problem representation. The approach also lends itself naturally to dynamic decision making in the context of problems that involve frequent revealings of uncertainty. Atari-fication thus offers a new approach with which to solve such problems, which are often more difficult to solve using traditional methods (we give examples of such problems in §4.2).

The approach is not without limitations, however. First, it seems likely that this approach would not apply in multi-agent (e.g., multi-vehicle) contexts, since the pairing of agents’ actions to specific controllable entities on the screen would be difficult for agents to interpret. While it is possible this approach would still work, such an environment would be quite different from that of the Atari-games on which the method has been tested and successfully demonstrated. Next, distance matrices may not always be preserved when converting from the traditional graph representation to that required for the Atari-fied formulation . In such cases, the solutions to the Atari-fied formulation will serve as approximations or (if properly modeled) bounds for the traditional formulation of the problem. For some applications, this may prohibit the use of Atari-fication. Lastly, we note that the proposed approach is perhaps not as radical as its name may imply. As alluded to in §3

, the approach may be more generally interpreted not as “Atari-fication,” but rather as a reformulation of the state so as to be amenable to a specific class of (deep RL) solution methods. This is analogous to how, in operations research, a researcher may choose to remodel the math program for a particular problem so as to accommodate solution via, e.g., column generation or Benders decomposition. Under this interpretation, our work here simply highlights one such opportunity for problems modeled as Markov decision processes.

4.2 Additional Opportunities

Atari-fication may be applicable to many problems involving sequential decision making under uncertainty. In particular, problems that are naturally visualizable are strong candidates for Atari-fying. Several NP-complete problems that form the basis for many practical research problems have this characteristic of being visualizable, such as the traveling salesman problem (TSP) and the knapsack problem (KP). The visualizability of the TSP may be exploited to solve many stochastic and dynamic VRP variants, as we demonstrated here with the VRPSSR. One can also imagine intuitive Atari-fied formulations of VRPs involving, e.g., pickups and deliveries, time windows, or stochastic travel times. The KP and related problems in scheduling and bin packing may lend themselves to Atari-fications that resemble Tetris, where the player is responsible for arranging newly-arriving pieces (representing, e.g., jobs) in some area on the screen (representing one or more machines). Opportunities for additional constraints and uncertainties to be captured in these Atari-fications include machines’ capacities and availabilities, as well as jobs’ resource demands, objective values, and durations. Given the breadth of applications of TSP- and KP-like problems alone — arising in transportation, manufacturing, energy, and healthcare — Atari-fication may serve researchers in many fields.

5 Conclusion

We present a new general approach to modeling research problems as Atari-like videogames to make them amenable to recent groundbreaking solution methods from the deep reinforcement learning community. The approach is flexible, applicable to a wide range of problems. We demonstrate its application on a well known vehicle routing problem. Our preliminary results on this problem, though not transformative, show signs of success and suggest that Atari-fication may be a useful modeling approach for researchers studying problems involving sequential decision making under uncertainty.


The authors especially thank Clément Grodecoeur for his efforts in the development and prototyping of the VRPSSR game.


  • Bent and Van Hentenryck (2004) Russell W Bent and Pascal Van Hentenryck. Scenario-based planning for partially dynamic vehicle routing with stochastic customers. Operations Research, 52(6):977–987, 2004.
  • Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
  • Caicedo and Lazebnik (2015) Juan C Caicedo and Svetlana Lazebnik. Active object localization with deep reinforcement learning. In

    Proceedings of the IEEE International Conference on Computer Vision

    , pages 2488–2496, 2015.
  • Chollet et al. (2015) François Chollet et al. Keras., 2015.
  • Dhariwal et al. (2017) Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines., 2017.
  • Farley and Clark (1954) BWAC Farley and W Clark. Simulation of self-organizing systems by digital computer. Transactions of the IRE Professional Group on Information Theory, 4(4):76–84, 1954.
  • Gendreau et al. (1999) Michel Gendreau, Francois Guertin, Jean-Yves Potvin, and Eric Taillard. Parallel tabu search for real-time vehicle routing and dispatching. Transportation science, 33(4):381–390, 1999.
  • He et al. (2015) Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong Li, Li Deng, and Mari Ostendorf. Deep reinforcement learning with a natural language action space. arXiv preprint arXiv:1511.04636, 2015.
  • Heinrich and Silver (2016) Johannes Heinrich and David Silver. Deep reinforcement learning from self-play in imperfect-information games. arXiv preprint arXiv:1603.01121, 2016.
  • Kawrykow et al. (2012) Alexander Kawrykow, Gary Roumanis, Alfred Kam, Daniel Kwak, Clarence Leung, Chu Wu, Eleyine Zarour, Luis Sarmenta, Mathieu Blanchette, Jérôme Waldispühl, et al. Phylo: a citizen science approach for improving multiple sequence alignment. PloS one, 7(3):e31362, 2012.
  • Khatib et al. (2011) Firas Khatib, Seth Cooper, Michael D Tyka, Kefan Xu, Ilya Makedon, Zoran Popović, and David Baker. Algorithm discovery by protein folding game players. Proceedings of the National Academy of Sciences, 108(47):18949–18953, 2011.
  • Kullman et al. (2019) Nicholas Kullman, Martin Cousineau, Jorge Mendoza, and Justin Goodson. Control of autonomous electric fleets for ridehail systems. In 7th INFORMS Transportation Science and Logistics Society Workshop, pages 192–195. INFORMS Transportation Science and Logistics Society, 2019.
  • Lample and Chaplot (2017) Guillaume Lample and Devendra Singh Chaplot. Playing FPS games with deep reinforcement learning. In

    Thirty-First AAAI Conference on Artificial Intelligence

    , 2017.
  • Lieberoth et al. (2015) Andreas Lieberoth, Mads Kock Pedersen, Andreea Catalina Marin, Tilo Planke, and Jacob Friis Sherson. Getting humans to do quantum optimization-user acquisition, engagement and early results from the citizen cyberscience game quantum moves. arXiv preprint arXiv:1506.08761, 2015.
  • Liu et al. (2017) Ying Liu, Brent Logan, Ning Liu, Zhiyuan Xu, Jian Tang, and Yangzhi Wang. Deep reinforcement learning for dynamic treatment regimes on medical registry data. In 2017 IEEE International Conference on Healthcare Informatics (ICHI), pages 380–385. IEEE, 2017.
  • Mendoza et al. (2014) Jorge E Mendoza, C Guéret, M Hoskins, H Lobit, V Pillac, T Vidal, and D Vigo. VRP-REP: the vehicle routing community repository. In Third Meeting of the EURO Working Group on Vehicle Routing and Logistics Optimization (VeRoLog). Oslo, Norway, 2014.
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
  • Mocanu et al. (2018) Elena Mocanu, Decebal Constantin Mocanu, Phuong H Nguyen, Antonio Liotta, Michael E Webber, Madeleine Gibescu, and Johannes G Slootweg. On-line building energy optimization using deep reinforcement learning. IEEE Transactions on Smart Grid, 2018.
  • Sallab et al. (2017) Ahmad EL Sallab, Mohammed Abdou, Etienne Perot, and Senthil Yogamani. Deep reinforcement learning framework for autonomous driving. Electronic Imaging, 2017(19):70–76, 2017.
  • Schaul et al. (2015) Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
  • Silver et al. (2018) David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
  • Sterling (2012) Amy Sterling. Eyewire, Dec 2012. URL Accessed on November 11, 2019.
  • Sutton et al. (1998) Richard S Sutton, Andrew G Barto, et al. Introduction to reinforcement learning, volume 2. MIT press Cambridge, 1998.
  • Toth and Vigo (2002) Paolo Toth and Daniele Vigo. The vehicle routing problem. SIAM, 2002.
  • Ulmer et al. (2018) Marlin W Ulmer, Justin C Goodson, Dirk C Mattfeld, and Marco Hennig. Offline–online approximate dynamic programming for dynamic vehicle routing with stochastic requests. Transportation Science, 53(1):185–202, 2018.
  • Van Hasselt et al. (2016) Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In Thirtieth AAAI conference on artificial intelligence, 2016.
  • Vinyals et al. (2017) Oriol Vinyals, Stephen Gaffney, and Timo Ewalds. DeepMind and Blizzard open StarCraft II as an AI research environment, Aug 2017. URL Accessed on November 11, 2019.
  • Vinyals et al. (2019) Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, pages 1–5, 2019.
  • Wang et al. (2015) Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.

Appendix A Additional Agent Details


The agent follows an -greedy policy with an initial value of 1.0 (chooses actions totally at random) and a final value of 0.1, which is decayed linearly over 1 million time steps. Future Q-values are discounted using a discount factor of .

Memory & training.

The agent has a memory capacity of 1 million steps. It begins training after it has observed 10k steps, at which point it undergoes training every 16 steps using (proportional) prioritized experience replay (PER) with a batch size of 32. We use PER hyperparameters

and with annealed to 1.0 over 600k steps.


We use a dueling double DQN (D3QN), where the primary and target networks are constructed as shown in Figure 4. We update the target network every

steps. We use the RMSprop optimizer with a learning rate of


Figure 4: The agent’s dueling DQN.