1 Introduction
The recent advances of supervised deep learning techniques
(LeCun et al., 2015)in computer vision, speech recognition and natural language processing have tremendously improved the performance on challenging tasks, including image processing
(Krizhevsky et al., 2012), speechbased translation (Sutskever et al., 2014) and language modeling (Hinton et al., 2012). The core idea of deep learning is to use artificial neural networks to model complex hierarchical or compositional data abstractions and representations from raw input data
(Bengio et al., 2013). However, we are still far from building intelligent solutions for many realworld challenges, such as autonomous driving, humancomputer interaction and automated decision making, in which software agents need to consider interactions with a dynamic environment and take actions towards goals. Reinforcement learning (Bertsekas & Tsitsiklis, 1996; Powell, 2011; Sutton & Barto, 1998; Kaelbling et al., 1996) studies these problems and algorithms which learn policies to make decisions so as to maximize a reward signal from the environment. One of the promising algorithms is Qlearning (Watkins, 1989; Watkins & Dayan, 1992). Deep reinforcement learning with neural function approximation (Tsitsiklis & Roy, 1997; Riedmiller, 2005; Mnih et al., 2013, 2015), possibly a first attempt to combine deep learning and reinforcement learning, has been proved to be effective on a few problems which classical AI approaches were unable to solve. Notable examples of deep reinforcement learning include humanlevel game playing (Mnih et al., 2015) and AlphaGo (Silver et al., 2016).Despite these successes, its high demand of computational resources makes deep reinforcement learning not yet applicable to many realworld problems. For example, even for an Atari game, the deep Qlearning algorithm (also called deep Qnetworks, abbreviated as DQN) needs to play up to hundreds of millions of game frames to achieve a reasonable performance (van Hasselt et al., 2015). AlphaGo trained its model using a database of game records of advanced players and, in addition, about 30 million selfplayed game moves (Silver et al., 2016). The sheer amount of required computational resources of current deep reinforcement learning algorithms is a major bottleneck for its applicability to realworld tasks. Moreover, in many tasks, the reward signal is sparse and delayed, thus making the convergence of learning even slower.
Here we propose optimality tightening, a new technique to accelerate deep Qlearning by fast reward propagation. While current deep Qlearning algorithms rely on a set of experience replays, they only consider a single forward step for the Bellman optimality error minimization, which becomes highly inefficient when the reward signal is sparse and delayed. To better exploit longterm highreward strategies from past experience, we design a new algorithm to capture rewards from both forward and backward steps of the replays via a constrained optimization approach. This encourages faster reward propagation which reduces the training time of deep Qlearning.
We evaluate our proposed approach using the Arcade learning environment (Bellemare et al., 2013) and show that our new strategy outperforms competing techniques in both accuracy and training time on 30 out of 49 games despite being trained with significantly fewer data frames.
2 Related Work
There have been a number of approaches improving the stability, convergence and runtime of deep reinforcement learning since deep Qlearning, also known as deep Qnetwork (DQN), was first proposed (Mnih et al., 2013, 2015). DQN combined techniques such as deep learning, reinforcement learning and experience replays (Lin, 1992; Wawrzynski, 2009).
Nonetheless, the original DQN algorithm required millions of training steps to achieve humanlevel performance on Atari games. To improve the stability, recently, double Qlearning was combined with deep neural networks, with the goal to alleviate the overestimation issue observed in Qlearning (Thrun & Schwartz, 1993; van Hasselt, 2010; van Hasselt et al., 2015). The key idea is to use two Qnetworks for the action selection and Qfunction value calculation, respectively. The greedy action of the target is first chosen using the current Qnetwork parameters, then the target value is computed using a set of parameters from a previous iteration. Another notable advance is “prioritized experience replay” (Schaul et al., 2016)
or “prioritized sweeping” for deep Qlearning. The idea is to increase the replay probability of experience tuples that have a high expected learning progress measured by temporal difference errors.
In addition to the aforementioned variants of Qlearning, other network architectures have been proposed. The dueling network architecture applies an extra network structure to learn the importance of states and uses advantage functions (Wang et al., 2015). A distributed version of the deep actorcritic algorithm without experience replay was introduced very recently (Mnih et al., 2016). It deploys multiple threads learning directly from current transitions. The approach is applicable to both valuebased and policybased methods, offpolicy as well as onpolicy methods, and in discrete as well as in continuous domains. The modelfree episodic control approach evaluates stateaction pairs based on episodic memory using knearest neighbors with hashing functions (Blundell et al., 2016). Bootstrapped deep Qlearning carries out temporallyextended (or deep) exploration, thus leading to much faster learning (Osband et al., 2016).
Our fast reward propagation differs from all of the aforementioned approaches. The key idea of our method is to propagate delayed and sparse rewards during Qnetwork training, and thus greatly improve the efficiency and performance. We formulate this propagation step via a constrained program. Note that our program is also different from earlier work on offpolicy algorithms with eligibility traces (Munos et al., 2016; Watkins, 1989), which have been recently shown to perform poorly when used for training deep Qnetworks on Atari games.
3 Background
Reinforcement learning considers agents which are able to take a sequence of actions in an environment. By taking actions and experiencing at most one scalar reward per action, their task is to learn a policy which allows them to act such that a high cumulative reward is obtained over time.
More precisely, consider an agent operating over time . At time the agent is in an environment state and reacts upon it by choosing action . The agent will then observe a new state and receive a numerical reward . Throughout, we assume the set of possible actions, i.e., the set , to be discrete.
A well established technique to address the aforementioned reinforcement learning task is Qlearning (Watkins, 1989; Watkins & Dayan, 1992). Generally, Qlearning algorithms maintain an actionvalue function, often also referred to as Qfunction, . Given a state , the actionvalue function provides a ‘value’ for each action
which estimates the expected future reward if action
is taken. The estimated future reward is computed based on the current state or a series of past states if available.The core idea of Qlearning is the use of the Bellman equation as a characterization of the optimal future reward function via a stateactionvalue function
(1) 
Hereby the expectation is taken w.r.t. the distribution of state and reward obtained after taking action , and is a discount factor. Intuitively, reward for taking action plus best future reward should equal the best total return from the current state.
The choice of Qfunction is crucial for the success of Qlearning algorithms. While classical methods use linear Qfunctions based on a set of handcrafted features of the state, more recent approaches use nonlinear deep neural networks to automatically mine intermediate features from the state (Riedmiller, 2005; Lange & Riedmiller, 2010; Mnih et al., 2013, 2015). This change has been shown to be very effective for many applications of reinforcement learning. However, automatic mining of intermediate representations comes at a price: larger quantities of data and more computational resources are required. Even though it is sometimes straightforward to extract large amounts of data, e.g., when training on video games, for successful optimization, it is crucial that the algorithms operate on uncorrelated samples from a dataset for stability. A technique called “experience replay” (Lin, 1992; Wawrzynski, 2009) encourages this property and quickly emerged as a standard step in the wellknown deep Qlearning framework (Mnih et al., 2013, 2015). Experience replays are stored as a dataset which contains stateactionrewardfuture statetuples , including past observations from previous plays.
The characterization of optimality given in Eq. (1) combined with an “experience replay” dataset results in the following iterative algorithmic procedure (Mnih et al., 2013, 2015): start an episode in the initial state ; sample a minibatch of tuples ; compute and fix the targets for each tuple using a recent estimate (the maximization is only considered if is not a terminal state); update the Qfunction by optimizing the following program w.r.t. the parameters
typically via stochastic gradient descent:
(2) 
After having updated the parameters of the Qfunction we perform an action simulation either choosing an action at random with a small probability , or by following the strategy which is currently estimated. This strategy is also called the greedy policy. We then obtain the actual reward . Subsequently we augment the replay memory with the new tuple and continue the simulation until this episode terminates or reaches an upper limit of steps, and we restart a new episode. When optimizing w.r.t. the parameter , a recent Qnetwork is used to compute the target . This technique is referred to as ‘semigradient descent,’ i.e., the dependence of the target on the parameter is ignored.
4 Fast Reward Propagation via Optimality Tightening
Investigating the cost function given in Eq. (2) more carefully, we observe that it operates on a set of short onestep sequences, each characterized by the tuple . Intuitively, each step encourages an update of the parameters , such that the actionvalue function for the chosen action , i.e., , is closer to the obtained reward plus the best achievable future value, i.e., . As we expect from the Bellman optimality equation, it is instructive to interpret this algorithm as propagating reward information from time backwards to time .
To understand the shortcomings of this procedure consider a situation where the agent only receives a sparse and delayed reward once reaching a target in a maze. Further let characterize the shortest path from the agents initial position to the target. For a long time, no real reward is available and the aforementioned algorithm propagates randomly initialized future rewards. Once the target is reached, real reward information is available. Due to the cost function and its property of propagating reward timestep by timestep, it is immediately apparent that it takes at least an additional iterations until the observed reward impacts the initial state.
In the following we propose a technique which increases the speed of propagation and achieves improved convergence for deep Qlearning. We achieve this improvement by taking advantage of longer stateactionrewardsequences which are readily available in the “experience replay memory.” Not only do we propagate information from time instances in the future to our current state, but also will we pass information from states several steps in the past. Even though we expect to see substantial improvements on sequences where rewards are sparse or only available at terminal states, we also demonstrate significant speedups for situations where rewards are obtained frequently. This is intuitive as the Qfunction represents an estimate for any reward encountered in the future. Faster propagation of future and past rewards to a particular state is therefore desirable.
Subsequently we discuss our technique for fast reward propagation, a new deep Qlearning algorithm that exploits longer statetransitions in experience replays by tightening the optimization via constraints. From the Bellman optimality equation we know that the following series of equalities hold for the optimal Qfunction :
Evaluating such a sequence exactly is not possible in a reinforcement learning setting since the enumeration of intermediate states requires exponential time complexity . It is however possible to take advantage of the episodes available in the replay memory by noting that the following sequence of inequalities holds for the optimal actionvalue function (with the greedy policy), irrespective of whether the policy generating the sequence of actions , , etc., which results in rewards , , etc. is optimal or not:
Note the definition of the lower bounds for sample and time horizon in the aforementioned series of inequalities.
We can also use this series of inequalities to define upper bounds. To see this note that
which follows from the definition of the lower bound by dropping the maximization over the actions, and a change of indices from . Reformulating the inequality yields an upper bound for sample and time horizon as follows:
In contrast to classical techniques which optimize the Bellman criterion given in Eq. (2), we propose to optimize the Bellman equation subject to constraints , which defines the largest lower bound, and , which specifies the smallest upper bound. Hereby, and are computed using the Qfunction with a recent estimated parameter rather than the unknown optimal Qfunction , and the integer specifies the number of future and past time steps which are considered. Also note that the target used in the Bellman equation is obtained from . In this way, we ignore the dependence of the bounds and the target on the parameter to stabilize the training. Taking all the aforementioned definitions into account, we propose the following program for reinforcement learning tasks:
(3) 
This program differs from the classical approach given in Eq. (2) via the constraints, which is crucial. Intuitively, the constraints encourage faster reward propagation as we show next, and result in tremendously better results as we will demonstrate empirically in Sec. 5.
Before doing so we describe our optimization procedure for the constrained program in Eq. (3) more carefully. The cost function is generally nonconvex in the parameters , and so are the constraints. We therefore make use of a quadratic penalty method to reformulate the program into
(4) 
where is a penalty coefficient and is the rectifier function. Augmenting the cost function with and/or results in a penalty whenever any optimality bounding constraint gets violated. The quadratic penalty function is chosen for simplicity. The penalty coefficient can be set as a large positive value or adjusted in an annealing scheme during training. In this work, we fix its value, due to time constraints. We optimize this cost function with stochastic (sub)gradient descent using an experience replay memory from which we randomly draw samples, as well as their successors and predecessors. We emphasize that the derivatives correcting the prediction of not only depend on the Qfunction from the immediately successive time step stored in the experience replay memory, but also on more distant time instances if constraints are violated. Our proposed formulation and the resulting optimization technique hence encourage faster reward propagation, and the number of time steps depends on the constant and the quality of the current Qfunction. We summarize the proposed method in Algorithm 1.
The computational complexity of the proposed approach increases with the number of considered time steps , since additional forward passes are required to compute the bounds and . However, we can increase the memory size on the GPU to compute both the bounds and targets in a single forward pass if is not too large. If at all a problem, we can further alleviate this increase by randomly sampling a subset of the constraints rather than exhaustively using all of them. More informed strategies regarding the choice of constraints are possible as well since we may expect lower bounds in the more distant future to have a larger impact early in the training. In contrast once the algorithm is almost converged we may expect lower bounds close to the considered timestep to have bigger impact.
To efficiently compute the discounted reward over multiple time steps we add a new element to the experience replay structure. Specifically, in addition to state, action, reward and next state for timestep , we also store the real discounted return which is the discounted cumulative return achieved by the agent in its game episode. is computed via , where is the end of the episode and is the discount factor. is then inserted in the replay memory after the termination of the current episode or after reaching the limit of steps. All in all, the structure of our experience replay memory consists of tuples of the form .
We leave the questions regarding a good choice of penalty function and a good choice of the penalty coefficients to future work. At the moment we use a quadratic penalty function and a constant penalty coefficient
identical for both bounds. More complex penalty functions and sophisticated optimization approaches may yield even better results than the ones we report in the following.5 Experiments
We evaluate the proposed algorithm on a set of 49 games from the Arcade Learning Environment (Bellemare et al., 2013) as suggested by Mnih et al. (2015)
. This environment is considered to be one of the most challenging reinforcement learning task because of its high dimensional output. Moreover, the intrinsic mechanism varies tremendously for each game, making it extremely demanding to find a single, general and robust algorithm and a corresponding single hyperparameter setting which works well across all 49 games.
Following existing work (Mnih et al., 2015), our agent predicts an action based on only raw image pixels and reward information received from the environment. A deep neural network is used as the function approximator for the Qfunction. The game image is resized to an grayscale image . The first layer is a convolutional layer with 32 filters of size
and a stride of 4; the second layer is a convolutional layer with 64 filters of size
and stride of 2; the third layer is a convolutional layer with 64 filters of sizeand a stride of 1; the next fully connected layer transforms the input to 512 units which are then transformed by another fully connected layer to an output size equal to the number of actions in each game. The rectified linear unit (ReLU) is used as the activation function for each layer. We used the hyperparameters provided by
Mnih et al. (2015) for annealinggreedy exploration and also applied RMSProp for gradient descent. As in previous work we combine four frames into a single step for processing. We chose the hyperparamenter
, for GPU memory efficiency when dealing with minibatches. In addition, we also incorporate the discounted return in the lower bound calculation to further stabilize the training. We use the penalty coefficient which was obtained by coarsely tuning performance on the games ‘Alien,’ ‘Amidar,’ ‘Assault,’ and ‘Asterix.’ Gradients are also rescaled so that their magnitudes are comparable with or without penalty. All experiments are performed on an NVIDIA GTX TitanX 12GB graphics card.5.1 Evaluation
In previous work (Mnih et al., 2015; van Hasselt et al., 2015; Schaul et al., 2016; Wang et al., 2015), the Qfunction is trained on each game using 200 million (200M) frames or 50M training steps. We compare to those baseline results obtained after 200M frames using our proposed algorithm which ran for only 10M frames or 2.5M steps, i.e., 20 times fewer data, due to time constraints. Instead of training more than 10 days we manage to finish training in less than one day. Furthermore, for a fair comparison, we replicate the DQN results and compare the performance of the proposed algorithm after 10M frames to those obtained when training DQN on only 10M frames.
We strictly follow the evaluation procedure in (Mnih et al., 2015) which is often referred to as ‘30 noop evaluation.’ During both training and testing, at the start of the episode, the agent always performs a random number of at most 30 noop actions. During evaluation, our agent plays each game 30 times for up to 5 minutes, and the obtained score is averaged over these 30 runs. An greedy policy with is used. Specifically, for each run, the game episode starts with at most 30 noop steps, and ends with ‘death’ or after a maximum of 5 minute gameplay, which corresponds to 18000 frames.
Our training consists of epochs, each containing 250000 frames, thus 10M frames in total. For each game, we evaluate our agent at the end of every epoch, and, following common practice (van Hasselt et al., 2015; Mnih et al., 2015), we select the best agent’s evaluation as the result of the game. So almost all hyperparameters are selected identical to Mnih et al. (2015) and Nair et al. (2015).
To compare the performance of our algorithm to the DQN baseline, we follow the approach of Wang et al. (2015) and measure the improvement in percent using
(5) 
We select this approach because the denominator choice of either human or baseline score prevents insignificant changes or negative scores from being interpreted as large improvements.
Fig. 1 shows the improvement of our algorithm over the DQN baseline proposed by Mnih et al. (2015) and trained for 200M frames, i.e., 50M steps. Even though our agent is only trained for 10M frames, we observe that our technique outperforms the baseline significantly. In 30 out of 49 games, our algorithm exceeds the baseline using only of the baseline’s training frames, sometimes drastically, e.g., in games such as ‘Atlantis,’ ‘Double Dunk,’ and ‘Krull.’ The remaining 19 games, often require a long training time. Nonetheless, our algorithm still reaches a satisfactory level of performance.
Training Time  Mean  Median  

Ours (10M)  less than 1 day (1 GPU)  345.70%  105.74% 
DQN (200M)  more than 10 days (1 GPU)  241.06%  93.52% 
DDQN (200M)  more than 10 days (1 GPU)  330.3%  114.7% 
In order to further illustrate the effectiveness of our method, we compare our results with our implementation of DQN trained on 10M frames. The results are illustrated in Fig. 2. We observe a better performance on 46 out of 49 games, demonstrating in a fair way the potential of our technique.
As suggested by van Hasselt et al. (2015), we use the following score
(6) 
to summarize the performance of our algorithm in a single number. We normalize the scores of our algorithm, the baseline reported by Mnih et al. (2015), and double DQN (DDQN) (van Hasselt et al., 2015), and report the training time, mean and median in Table 1. We observe our technique with 10M frames to achieve comparable scores to the DDQN method trained on 200M frames (van Hasselt et al., 2015), while it outperforms the DQN method (Mnih et al., 2015) by a large margin. We believe that our method can be readily combined with other techniques developed for DQN, such as DDQN (van Hasselt et al., 2015), prioritized experience replay (Schaul et al., 2016), dueling networks (Wang et al., 2015), and asynchronous methods (Mnih et al., 2016) to further improve the accuracy and training speed.
In Fig. 3 we illustrate the evolution of the score for our algorithm and the DQN approach for the 6 games ‘Frostbite,’ ‘Atlantis,’ ‘Zaxxon,’ ‘H.E.R.O,’ ‘Q*Bert,’ and ‘Chopper Command.’ We observe our method to achieve significantly higher scores very early on. Importantly our technique increases the gap between our approach and the DQN performance even during later stages of the training. We refer the reader to the supplementary material for additional results and raw scores.
6 Conclusion
In this paper we proposed a novel program for deep Qlearning which propagates promising rewards to achieve significantly faster convergence than the classical DQN. Our method significantly outperforms competing approaches even when trained on a small fraction of the data on the Atari 2600 domain. In the future, we plan to investigate the impact of penalty functions, advanced constrained optimization techniques and explore potential synergy with other techniques.
References

Bellemare et al. (2013)
M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling.
The arcade learning environment: An evaluation platform for general
agents.
J. of Artificial Intelligence Research
, 2013.  Bengio et al. (2013) Y. Bengio, A. Courville, and P. Vincent. Representation Learning: A Review and New Perspectives. PAMI, 2013.
 Bertsekas & Tsitsiklis (1996) D. P. Bertsekas and J. N. Tsitsiklis. NeuroDynamic Programming. Athena Scientific, 1996.
 Blundell et al. (2016) C. Blundell, B. Uria, A. Pritzel, Y. Li, A. Ruderman, J. Z. Leibo, J. Rae, D. Wierstra, and D. Hassabis. ModelFree Episodic Control. In http://arxiv.org/pdf/1606.04460v1.pdf, 2016.
 Hinton et al. (2012) G. E. Hinton, L. Deng, D. Yu, G. E. Dahl, A.R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 2012.
 Kaelbling et al. (1996) L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. JMLR, 1996.
 Krizhevsky et al. (2012) A. Krizhevsky, I. Sutskever, , and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proc. NIPS, 2012.
 Lange & Riedmiller (2010) S. Lange and M. Riedmiller. Deep autoencoder neural networks in reinforcement learning. In Proc. Int. Jt. Conf. Neural. Netw., 2010.
 LeCun et al. (2015) Y. LeCun, Y. Bengio, and G. E. Hinton. Deep learning. Nature, 2015.
 Lin (1992) L.J. Lin. Selfimproving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 1992.
 Mnih et al. (2013) V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing Atari with Deep Reinforcement Learning. In NIPS Deep Learning Workshop, 2013.
 Mnih et al. (2015) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Humanlevel control through deep reinforcement learning. Nature, 2015.
 Mnih et al. (2016) V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous Methods for Deep Reinforcement Learning. In https://arxiv.org/abs/1602.01783, 2016.
 Munos et al. (2016) R. Munos, T. Stepleton, A. Harutyunyan, and M. G. Bellemare. Safe and efficient offpolicy reinforcement learning. In Proc. NIPS, 2016.
 Nair et al. (2015) A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, V. Panneershelvam A. De Maria, M. Suleyman, C. Beattie, S. Petersen, S. Legg, V. Mnih, K. Kavukcuoglu, and D. Silver. Massively Parallel Methods for Deep Reinforcement Learning. In https://arxiv.org/abs/1507.04296, 2015.
 Osband et al. (2016) I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep Exploration via Bootstrapped DQN. In http://arxiv.org/abs/1602.04621, 2016.
 Powell (2011) W. P. Powell. Approximate Dynamic Programming. Wiley, 2011.
 Riedmiller (2005) M. Riedmiller. Neural fitted Q iteration  first experiences with a data efficient neural reinforcement learning method. In Proc. ECML, 2005.
 Schaul et al. (2016) T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized Experience Replay. In Proc. ICLR, 2016.
 Silver et al. (2016) D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 2016.
 Sutskever et al. (2014) I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Proc. NIPS, 2014.
 Sutton & Barto (1998) R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
 Thrun & Schwartz (1993) S. Thrun and A. Schwartz. Issues in using function approxima tion for reinforcement learning. In Proc. Connectionist Models Summer School, 1993.
 Tsitsiklis & Roy (1997) J. N. Tsitsiklis and B. Van Roy. An analysis of temporaldifference learning with function approximation. 1997.
 van Hasselt (2010) H. van Hasselt. Double Qlearning. In Proc. NIPS, 2010.
 van Hasselt et al. (2015) H. van Hasselt, A. Guez, and D. Silver. Deep Reinforcement Learning with Double Qlearning. In https://arxiv.org/abs/1509.06461, 2015.
 Wang et al. (2015) Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de Freitas. Dueling Network Architectures for Deep Reinforcement Learning. In https://arxiv.org/abs/1511.06581, 2015.
 Watkins (1989) C. J. C. H. Watkins. Learning from delayed rewards. PhD thesis, University of Cambridge England, 1989.
 Watkins & Dayan (1992) C. J. C. H. Watkins and P. Dayan. Qlearning. Machine Learning, 1992.
 Wawrzynski (2009) P. Wawrzynski. Realtime reinforcement learning by sequential actorcritics and experience replay. Neural Networks, 2009.
Appendix A Supplementary Material
We present our quantitative results in Table S1 and Table S2. We also illustrate the normalized score provided in Eq. (6) over the number of episodes in Fig. S1.
Game  Random  Human  DQN 200M  Ours 10M 

Alien  227.80  6875  3069  1864 
Amidar  5.8  1676  739.5  565.67 
Assault  222.4  1496  3359  5142.37 
Asterix  210  8503  6012  5408.33 
Asteroids  719.1  13157  1629  1481.67 
Atlantis  12850  29028  85641  316766.67 
Bank Heist  14.2  734.4  429.7  596 
Battle Zone  2360  37800  26300  30800 
Beam Rider  363.9  5775  6846  8069 
Bowling  23.1  154.8  42.4  49.3 
Boxing  0.1  4.3  71.8  81.17 
Breakout  1.7  31.8  401.2  229.79 
Centipede  2091  11963  8309  4470.06 
Chopper Command  811  9882  6687  6360 
Crazy Climber  10781  35411  114103  114146 
Demon Attack  152.1  3401  9711  5738.67 
Double Dunk  18.6  15.5  18.1  10.07 
Enduro  0  309.6  301.8  672.83 
Fishing Derby  91.7  5.5  0.8  5.27 
Freeway  0  29.6  30.3  31.3 
Frostbite  65.2  4335  328.3  3974.11 
Gopher  257.6  2321  8520  4660 
Gravitar  173  2672  306.7  346.67 
H.E.R.O  1027  25763  19950  19975 
Ice Hockey  11.2  0.9  1.6  3.43 
Jamesbond  29  406.7  576.7  1088.33 
Kangaroo  52  3035  6740  11716.67 
Krull  1598  2395  3805  9461.1 
KungFu Master  258.5  22736  23270  27820 
Montezuma’s Revenge  0  4376  0  23.33 
Ms. Pacman  307.3  15693  2311  1805 
Name This Game  2292  4076  7257  7314.67 
Pong  20.7  9.3  18.9  19.4 
Private Eye  24.9  69571  1788  342.37 
Q*Bert  163.9  13455  10596  12355 
River Raid  1339  13513  8316  8028.33 
Road Runner  11.5  7845  18257  29346.67 
Robotank  2.2  11.9  51.6  34.5 
Seaquest  68.4  20182  5286  4070 
Space Invaders  148  1652  1976  995 
Star Gunner  664  10250  57997  16653.95 
Tennis  23.8  8.9  2.5  1 
Time Pilot  3568  5925  5947  5423.33 
Tutankham  11.4  167.6  186.7  232 
Up and Down  533.4  9082  8456  14406 
Venture  0  1188  380  286.67 
Video Pinball  16257  17298  42684  74873.2 
Wizard of Wor  563.5  4757  3393  4716.67 
Zaxxon  32.5  9173  4977  10598 
Game  DQN 200M  Ours 10M 

Alien  42.74%  24.62% 
Amidar  43.93%  33.52% 
Assault  246.27%  386.31% 
Asterix  69.96%  62.68% 
Asteroids  7.32%  6.13% 
Atlantis  449.94%  1878.60% 
Bank Heist  57.69%  80.78% 
Battle Zone  67.55%  80.25% 
Beam Rider  119.79%  142.39% 
Bowling  14.65%  19.89% 
Boxing  1707.14%  1930.24% 
Breakout  1327.24%  757.77% 
Centipede  62.99%  24.10% 
Chopper Command  64.78%  61.17% 
Crazy Climber  419.50%  419.67% 
Demon Attack  294.22%  171.95% 
Double Dunk  16.13%  275.16% 
Enduro  97.48%  217.32% 
Fishing Derby  93.52%  99.76% 
Freeway  102.36%  105.74% 
Frostbite  6.16%  91.55% 
Gopher  400.43%  213.36% 
Gravitar  5.35%  6.95% 
H.E.R.O  76.50%  76.60% 
Ice Hockey  79.34%  64.22% 
Jamesbond  145.00%  280.47% 
Kangaroo  224.20%  391.04% 
Krull  276.91%  986.59% 
KungFu Master  102.38%  122.62% 
Montezuma’s Revenge  0%  0.53% 
Ms. Pacman  13.02%  9.73% 
Name This Game  278.31%  281.54% 
Pong  132%  133.67% 
Private Eye  2.54%  0.46% 
Q*Bert  78.49%  91.73% 
River Raid  57.31%  54.95% 
Road Runner  232.92%  374.48% 
Robotank  509.28%  332.99% 
Seaquest  25.94%  19.90% 
Space Invaders  121.54%  56.31% 
Star Gunner  598.10%  166.81% 
Tennis  142.95%  153.02% 
Time Pilot  100.93%  78.72% 
Tutankham  112.23%  141.23% 
Up and Down  92.68%  162.38% 
Venture  31.99%  24.13% 
Video Pinball  2538.62%  5630.76% 
Wizard of Wor  67.47%  99.04% 
Zaxxon  54.09%  115.59% 
Comments
There are no comments yet.