Introduction
Modelfree deep reinforcement learning methods have recently demonstrated impressive performance on a range of complex learning tasks [Hessel et al.2018, Lillicrap et al.2017, Jaderberg et al.2017]. Deep QNetworks (DQNs) [Mnih et al.2015], in particular, stand out as a versatile and effective tool for a wide range of applications. DQNs offer an endtoend solution for many reinforcement learning problems by generalizing tabular Qlearning to highdimensional input spaces. Unfortunately, DQNs still suffer from a number of important drawbacks that hinder widespread adoption. There is a general lack of theoretical understanding to guide the use of deep Qlearning algorithms. Combining nonlinear function approximation with offpolicy reinforcement learning is known to lead to possible divergence issues, even in very simple tasks [Boyan and Moore1995]. Despite improvements introduced by deep Qnetworks [Mnih et al.2015]
, Qlearning approaches based on deep neural networks still exhibit occasional instability during the learning process, see for example
[van Hasselt, Guez, and Silver2016].In this paper, we aim to address the instability of current deep RL approaches by augmenting modelfree value learning with a predictive model. We propose a hybrid model and value learning objective, which stabilises the learning process and reduces sample complexity when compared to modelfree algorithms. We attribute this reduction to the richer feedback signal provided by our method compared to standard deep Qnetworks. In particular, our model provides feedback at every time step based on the current prediction error, which in turn eliminates one source of sample inefficiency caused by sparse rewards. These conclusions are also in accordance with previous research conducted on linear function approximation in reinforcement learning [Parr et al.2008, Sutton et al.2012, Song et al.2016].
While linear function approximation results provide a motivation for modelbased stabilisation, such theories fail to generalise to deep nonlinear architectures [Song et al.2016]. To close this gap in the literature and demonstrate stability of our method, we prove convergence in the general deep RL setting with deep value function approximators. Theoretically analysing deep RL algorithms, however, is challenging due to nonlinear functional dependencies and objective nonconvexities prohibiting convergence to globally optimal policies. As such, the best one would hope for is a stationary point (e.g., a firstorder one) that guarantees vanishing gradients. Even understanding gradient behavior for deep networks in RL can be difficult due to explorationbased policies and replay memory considerations.
To alleviate some of these problems, we map deep reinforcement learning to a mathematical framework explicitly designed for understanding the exploration/exploitation tradeoff. Namely, we formalise the problem as an online learning game the agent plays against an adversary (i.e., the environment). Such a link allows us to study a more general problem combining notions of regret, reinforcement learning, and optimisation. Given such a link, we prove that for any , regret vanishes as with being the total number of rounds. This, in turn, guarantees convergence to a stationary point.
Guided by the above theoretical results, we validate that our algorithm leads to faster learning by conducting experiments on 20 Atari games. Due to the high computational and budgeting demands, we chose a benchmark that is closest to our setting. Concretely, we picked DQNs as our template algorithm to improve on both theoretically and empirically. It is worth noting that extending our theoretical results to more complex RL algorithms is an orthogonal future direction.
Background: Reinforcement Learning
We consider a discrete time, infinite horizon, discounted Markov Decision Process (MDP)
. Here denotes the state space, the set of possible actions, is the reward function, is the state transition function, and is a discount factor. An agent, being in state at time step , executes an action sampled from its policy, a conditional probability distribution where
. The agent’s action elicits from the environment a reward signal indicating instantaneous reward, a terminal flag indicating a terminal event that restarts the environment, and a transition to a successor state . We assume that the sets , , and are discrete. The reward is sampled from the conditional probability distribution . Similarly, with , where a terminal event () restarts the environment according to some initial state distribution . The state transition to a successor state is determined by a stochastic state transition function according to .The agent’s goal is to maximise future cumulative reward
with respect to the policy . An important quantity in RL are Qvalues , which are defined as the expected future cumulative reward value when executing action in state and subsequently following policy . Qvalues enable us to conveniently phrase the RL problem as , where is the stationary state distribution obtained when executing in the environment (starting from ).
Deep QNetworks:
Valuebased reinforcement learning approaches identify optimal Qvalues directly using parametric function approximators , where represents the parameters [Watkins1989, Busoniu et al.2010]
. Optimal Qvalue estimates then correspond to an optimal policy
. Deep Qnetworks [Mnih et al.2015]learn a deep neural network based Qvalue approximation by performing stochastic gradient descent on the following training objective:
(1) 
The expectation ranges over transitions sampled from an experience replay memory ( denotes the state at the next time step). Use of this replay memory, together with the use of a separate target network (with different parameters ) for calculating the bootstrap values , helps stabilise the learning process.
Background: Online Learning
In this paper, we employ a special form of regret minimisation games that we briefly review here. A regret minimisation game is a triple , where is a nonempty decision set, is the set of moves of the adversary which contains bounded convex functions from to and is the total number of rounds. The game commences in rounds, where at round , the agent chooses a prediction
and the adversary a loss function
. At the end of the round, the adversary reveals its choice and the agent suffers a loss . In this paper, we are concerned with the fullinformation case where the agent can observe the complete loss function at the end of each round. The goal of the game is for the agent to make successive predictions to minimise cumulative regret defined as:(2) 
To use such a framework in our analysis, we require a map from RL to online learning, achieved via a generalisation of the loss function in Equation (1) as described later.
Problem Definition: ModelBased Stabilisation
Our goal is to stabilise deep RL with a predictive model component. We therefore extend modelfree deep Qnetworks [Mnih et al.2015]^{1}^{1}1For the sake of brevity, we present our approach using the ’vanilla’ DQN framework, as originally proposed by [Mnih et al.2015]. The current stateoftheart framework for valuebased deep RL is the Rainbow framework [Hessel et al.2018], which is a combination of several independent DQN improvements. However, because these extensions do not use an environment model, the approach presented in this paper is orthogonal to the individual Rainbow components and could be combined with Rainbow. to predict the future environment in addition to Qvalues. In this section, we describe our modelbased DQNextension. In the next section, we prove that this indeed leads to convergence.
To enable modelbased stabilisation, we add three new actionconditioned output heads to the standard DQN architecture. These heads represent reward, termination, and next observation predictors. The new outputs share the feature learning pipeline with the main Qnetwork (excluding the final two fullyconnected layers). For the reward and termination prediction, we add additional fullyconnected layers, whereas the nextobservation prediction uses a deconvolutional structure (see appendix for the exact architecture).
The optimization objective for training the network comprises four individual loss functions that are jointly minimized with respect to the network weights : one for Qvalue prediction and three additional loss functions. The first additional loss function is for predicting the terminal flag , the second for predicting the instantaneous reward , and the third for predicting the video frame at the next time step . All these parts are additively connected leading to a compound loss
(3) 
where , , and are positive coefficients to balance the individual parts. The compound loss is an offpolicy objective that can be trained with environment interactions obtained from any policy. The individual parts of Equation (3) can then be expressed as expectations over transitions sampled from a replay memory.
The Qvalue prediction loss can be written as in Equation (1), whereas the terminal flag prediction loss is
(4) 
where refers to the terminal flag predictor given a specific state and action. Note that the terminal flag is binary and is a parametric categorical probability distribution.
Similarly, the loss for instantaneous reward prediction is
(5) 
where is a stateactionconditioned parametric categorical distribution over the reward signal.
The loss for predicting the video frame at the next time step can be formulated as
(6) 
where refers to a deterministic parametric map to predict the next state observation (denoted ) , given the current state and the action (i.e. predict the next video frame given the last four video frames and action in Atari). refers to the squared Frobenius norm between two images.
Our modelbased algorithm commences in rounds, see Algorithm 1. At each time step, the agent observes the current environmental state and takes an epsilongreedy action with respect to the current Qvalues. This leads the environment to elicit a reward signal, a terminal flag, and transition to a successor state. The experience replay is updated accordingly. Every fourth interaction with the environment, the agent samples experiences from the experience replay to perform a gradient update step on the loss from Equation (3). The target network is periodically updated.
Theoretical Analysis
Here, we confirm that our modelbased approach stabilises the learning process by guaranteeing convergence to a stationary point in parameter space. To derive our convergence guarantees, we map the optimization problem in Equation (3) to an equivalent online learning form.
Before constructing this map, we augment the optimisation objective with a regulariser . The goal of such an addition is to stabilise the learning objective and reduce overfitting that can occur when considering deep networks. Most importantly, however, we incorporate the regulariser to avoid degenerate solutions that fail to achieve nontrivial regret guarantees (e.g., Hannan’s consistency), see [Rakhlin2009]. We do not assume a specific regulariser, however we assume a convex form (e.g., L, L norms).
With this in mind, the learning objective on the iteration can be written as:
(7) 
where is defined as:
In words, Equation (7) states that the goal of the agent is to choose a parameter that minimises the accumulated loss so far, while taking into account regularisation. That being said, it is customary in reinforcement learning not to consider an equally weighted summation over iterations. Practitioners typically consider a decaying moving average, which focuses more on losses recently encountered and less on historical ones. To adhere to such practices, we rewrite our objective to include a rounddependent weighting as:
(8) 
Rather than heuristically choosing
, we defer to our theoretical guarantees for its optimal setting. Though Equation (8) looks as if it parts ways with the original objective, it can be easily seen as a generalisation. We can recover the original objective by choosing if and otherwise (note that the optimization objective in DQNstyle algorithms is rounddependent since both the experience replay and the target network change over time). Hence, if we are able to establish convergence properties for Equation (8), we can easily recover convergence for the special case.Analysing the convergence of Equation (8) needs to be done with care as the summation limit is rounddependent, and hence, needs to be considered as a streaming model. An ideal framework to understand the theoretical properties of such problems is that of online learning as discussed above. To see the connection between Equation (8) and online learning, we commence by defining , and . Hence, one can write:
Allowing to write the solution of Equation (8) recursively:
which is a standard online learning formulation, see [Rakhlin2009, AbbasiYadkori et al.2013].
As in any standard online learning game, the goal for the agent is to determine a set of models that minimize accumulated regret, , after rounds. From Equation (2), one can see that regret analysis compares the partial information (online) solution to the full information one, i.e., the solution acquired after observing all data. It is also clear that a desirable property of the algorithm is to exhibit sublinear regrets as the number of rounds increases. Such a property, in turn, guarantees vanishing average regret, i.e., in terms of rounds. If achieved, one can then conclude convergence for the algorithm, see [AbbasiYadkori et al.2013].
Regret Bound
This section is dedicated to proving vanishing regrets of our optimization algorithm. Our proof consists of three main steps. First, we approximate the original objective using a firstorder Taylor expansion. In order not to have a crude approximation, we vary the expansion’s operating point to be the previous round constrained solution. Under this approximation, we commence to bound the norm of the gradient in the general setting of convolutional/deconvolutional deep neural networks in terms of the regularization coefficients . Given this result, we schedule each of these free parameters to acquire vanishing regrets. We summarize our results in the following theorem:
Theorem 1.
Assume a compact, convex, and closed set , for all
, Lipschitz continuous activation functions with Lipschitz constant
(e.g., softmax). Choosing , for any , the regret after rounds vanishes, abiding by:for any .
The complete proof of the theorem is rather involved and as such is left to the appendix. In what comes next, we elaborate on the main steps needed to arrive at the theorem.
Bounding the gradient
To bound the gradient, we approximate the full objective at round using a firstorder Taylor expansion around the previous round solution . Though increasing the complexity of our analysis, we believe a timedependent operating point based on previous round solutions is crucial to keep track of rounddependent changes, thus avoiding crude approximation to the original objective. For any , the firstorder expansion of is:
with:
As such, can be bounded by:
Hence, to bound , we need to bound each of , and .
Using our assumptions, it is easy to see that because belongs to . For the remaining two norms, we can prove the following:
Lemma 1.
Assume a compact, convex, and closed set , for all , Lipschitz continuous activation functions with Lipschitz constant (e.g., softmax). The norms of and are bounded by:
where are constants in . Their exact values can be found in the appendix due to space constraints. Furthermore, the loss is bounded by:
Again, the values for are in the appendix.
Proof Roadmap
Due to the usage of a deep convolutional/deconvolutional network to approximate Qvalues, the proof of the above lemma is involved and as such is left to the supplementary material. In short, the proof consists of three main steps. In the first, we formally derive the output of each of the network layers in terms of input image tensors. Given this formalization, in the second step, we derive the gradient with respect to the unknown parameters in closed form. Incorporating these results with our assumptions, using the triangular inequality and CauchySchwarz, we arrive at the statement of the lemma.
Completing the Regret Proof
To prove the statement of Theorem 1, the remaining step is to incorporate the above bounds into the results of [AbbasiYadkori et al.2013]. Namely, we can show that the regret behaves as:
where are constants that can be found in the appendix. Now, we choose optimal values for each each of and ’s to acquire sublinear regret behavior.
Among different choices for these hyperparameters, our results need to relate to common practices in deep reinforcement learning. For instance, deep RL weighs current round losses higher compared to history (e.g., exponential decaying window). This can be seen as choosing
. Furthermore , and setting for guaranteeswhere . Therefore, the overall regret bound is given by . This finalises the statement of the theorem guaranteeing vanishing average regret, i.e., .
Remark: The above proof verifies that our method exhibits vanishing regrets and therefore converges to a stationary point. We note that this result does not hold for the unregularized case of basic DQN learning. A counterexample can be easily constructed where the regret of DQN after becomes at least linear (see appendix).
A second observation is that the regularized objective in Equation (3) can be considered as a constrained optimization objective. Intuitively, this approach shrinks the feasibility set, thereby changing the set of local optima for DQNs.
The result of vanishing regrets also guarantees convergence to a stationary point, where an algorithm with vanishing regrets exhibits convergence.
In the next section, we empirically verify that our algorithm exhibits lower sample complexity. Due to high computational and budgeting demands, we chose the closest algorithm to our setting to conduct experiments against. This is desirable as our results extend deep Qnetworks and as such, serve as a realistic benchmark. Please note that extending our method to other forms of deep RL is an orthogonal direction to this work that we leave for future investigation.
Game  DQN  Our approach 

Amidar  22.3%  19.5% 
Assault  34.1%  33.3% 
Asterix  37.4%  46.4% 
Battle Zone  20.2%  83.7% 
Berzerk  11.6%  12.3% 
Chopper Command  54.9%  86.2% 
Crazy Climber  234.3%  378.7% 
Kangaroo  242.6%  224.5% 
Krull  713.2%  663.9% 
Kung Fu Master  91.1%  115.8% 
Ms Pacman  28.2%  32.4% 
Qbert  72.8%  93.8% 
Road Runner  582.2%  615.5% 
Robotank  416.8%  122.6% 
Seaquest  11.3%  13.6% 
Space Invaders  50.0%  50.5% 
Star Gunner  511.8%  185.7% 
Time Pilot  13.7%  5.8% 
Tutankham  161.1%  167.8% 
Up’n Down  66.5%  72.7% 
Median  60.7%  85.0% 
Experiments
We empirically validate our approach in the Atari domain [Bellemare et al.2013]. We compare against ordinary RL without modelbased stabilisation (DQN, [Mnih et al.2015]). Our aim is to verify that our approach exhibits stable learning and improves learning results. We show that our method leads to significantly better game play performance and sample efficiency across 20 Atari games.
As a proof of concept, in Figure 1, we visualize model predictions and compare to ground truth frames over a time horizon of eleven steps (44 frames) starting from an initial state after executing the policy for steps. In the example depicted, the reward signal is perfectly predicted. Modelpredicted video frames accurately match ground truth frames. At time steps 10 and 11, a random object is entering the scene from the left, which our model cannot predict because there is no information available at time step 0 to foresee this event, in accordance with [Oh et al.2015].
We next analyze the different components of the loss proposed in Equation (3). As an illustrative example, we report the losses during training on the game Kung Fu Master (see Figure 2). Clearly, the compound loss is dominated by the Qvalue loss and only mildly affected by all other loss parts.
In order to quantify game play performance, we store agent networks every steps during learning and conduct an offline evaluation by averaging over episodes, each of which comprises at most steps (but terminate earlier in case of a terminal event). In evaluation mode, agents follow an epsilongreedy strategy with [Mnih et al.2015]. The results of this evaluation are depicted in Figure 3 for five individual Atari games and for the median score over all 20 Atari games (smoothed with an exponential window of ). The median is taken by normalizing raw scores with respect to human and random scores according to [Mnih et al.2015].
Our modelbased approach significantly outperforms the modelfree baseline on each individual game depicted and on average over all 20 games in the course of training. Additionally, we report normalized game scores obtained by the bestperforming agent throughout the learning process (see Table 1). Notably, our bestperforming agent outperforms the DQN baseline on 14 out of 20 individual games.
To demonstrate sample efficiency, we identify the number of time steps at which maximum DQN performance is attained first. To do so, we smooth the episodic reward obtained via offline evaluation with an exponential window of size . Figure 4 shows that our approach improves sample complexity over the modelfree baseline on each of the five games depicted and on average over all 20 games.
Related Work
While modelfree RL aims to learn a policy directly from transition samples, modelbased RL attempts to be more efficient by learning about the environment and the actual RL task. In general, there are four types of modebased RL.
(1) Planning approaches use the environment model to solve a virtual RL problem and act accordingly in the real environment [Wang2009, Browne et al.2012, Russell and Norvig2016]. (2) DYNAstyle learning [Sutton1990, Sutton et al.2012] augments the dataset with virtual samples for RL training. These virtual samples are generated from a learned environment model and are combined with samples from the actual environment to produce the final learning update. (3) Explicit exploration approaches use models to direct exploration. They encourage the agent to take actions that most likely lead to novel environment states [Stadie, Levine, and Abbeel2015, Oh et al.2015, Pathak et al.2017, Jaderberg et al.2017]. (4) Training a predictive model to extract features for value function approximation. Recent research has identified a relation between model learning and value function approximation [Parr et al.2008, Song et al.2016]. This work provides a theoretical basis for feature learning that connects the model prediction to the Bellman error for linear value function approximation.
Our work fits conceptually into the latter category. The results from [Song et al.2016] do however not readily carry over to nonlinear function approximators. Our work fills this gap by theoretically proving that stable deep RL can be obtained by joint value function and model learning. Practically, we demonstrate that this also speeds up the training procedure in terms of better sample complexity in Atari.
The most challenging contemporary environments for modelbased RL are robotics and visionbased domains. In robotics, there is a range of modelbased RL approaches that have been successfully deployed [Deisenroth and Rasmussen2011, Levine and Koltun2013, Levine and Abbeel2014, Heess et al.2015, Gu et al.2016, Pong et al.2018], even for visual state spaces [Wahlström, Schön, and Deisenroth2015, Watter et al.2015, Finn et al.2016, Levine et al.2016]. However, in other visionbased domains (e.g. Atari), modelbased learning has been less successful, despite plenty of modellearning approaches that demonstrably learn accurate environment models [Oh et al.2015, Fu and Hsu2016, Chiappa et al.2017, Leibfried, Kushman, and Hofmann2017, Wang, Kosson, and Mu2017, Weber et al.2017, Buesing et al.2018]. One exception is the preliminary work of [Alaniz2017] obtaining impressive results in Minecraft with Monte Carlo tree search.
Conclusion
In this work, we addressed the problem of instable learning in deep RL. We introduced a new optimization objective and network architecture for deep valuebased reinforcement learning by extending conventional deep Qnetworks with a modellearning component. In our theoretical analysis, we formally show that our proposed approach convergences to a stationary point in parameter space, whereas ’vanilla’ DQNs can diverge. Empirically our approach yields significant improvements on 20 Atari games in both sample complexity and overall performance when compared to modelfree RL without modelbased stabilisation.
References
 [AbbasiYadkori et al.2013] AbbasiYadkori, Y.; Bartlett, P. L.; Kanade, V.; Seldin, Y.; and Szepesvari, C. 2013. Online learning in Markov decision processes with adversarially chosen transition probability distributions. In Advances in Neural Information Processing Systems.
 [Alaniz2017] Alaniz, S. 2017. Deep reinforcement learning with model learning and Monte Carlo tree search in Minecraft. In Proceedings of the Multidisciplinary Conference on Reinforcement Learning and Decision Making.

[Bellemare et al.2013]
Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M.
2013.
The Arcade Learning Environment: An evaluation platform for general
agents.
Journal of Artificial Intelligence Research
47:253–279.  [Boyan and Moore1995] Boyan, J. A., and Moore, A. W. 1995. Generalization in reinforcement learning: Safely approximating the value function. In Advances in Neural Information Processing Systems.
 [Browne et al.2012] Browne, C.; Powley, E.; Whitehouse, D.; Lucas, S.; Cowling, P. I.; Rohlfshagen, P.; Tavener, S.; Perez, D.; Samothrakis, S.; and Colton, S. 2012. A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games 4(1):1–49.
 [Buesing et al.2018] Buesing, L.; Weber, T.; Racaniere, S.; Ali Eslami, S. M.; Rezende, D.; Reichert, D. P.; Viola, F.; Besse, F.; Gregor, K.; Hassabis, D.; and Wierstra, D. 2018. Learning and querying fast generative models for reinforcement rearning. In arXiv.
 [Busoniu et al.2010] Busoniu, L.; Babuska, R.; De Schutter, B.; and Ernst, D. 2010. Reinforcement Learning and Dynamic Programming using Function Approximators. CRC Press.
 [Chiappa et al.2017] Chiappa, S.; Racaniere, S.; Wierstra, D.; and Mohamed, S. 2017. Recurrent environment simulators. In Proceedings of the International Conference on Learning Representations.

[Deisenroth and
Rasmussen2011]
Deisenroth, M. P., and Rasmussen, C. E.
2011.
PILCO: A modelbased and dataefficient approach to policy search.
In
Proceedings of the International Conference on Machine Learning
. 
[Dosovitskiy, Springenberg, and
Brox2015]
Dosovitskiy, A.; Springenberg, J. T.; and Brox, T.
2015.
Learning to generate chairs with convolutional neural networks.
InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition
. 
[Finn et al.2016]
Finn, C.; Tan, X. Y.; Duan, Y.; Darrell, T.; Levine, S.; and Abbeel, P.
2016.
Deep spatial autoencoders for visuomotor learning.
In Proceedings of the IEEE International Conference on Robotics and Automation.  [Fu and Hsu2016] Fu, J., and Hsu, I. 2016. Modelbased reinforcement learning for playing Atari games. Technical Report, Stanford University.
 [Glorot and Bengio2010] Glorot, X., and Bengio, Y. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics.
 [Glorot, Bordes, and Bengio2011] Glorot, X.; Bordes, A.; and Bengio, Y. 2011. Deep sparse rectifier neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics.
 [Gu et al.2016] Gu, S.; Lillicrap, T.; Sutskever, I.; and Levine, S. 2016. Continuous deep Qlearning with modelbased acceleration. In Proceedings of the International Conference on Machine Learning.
 [Heess et al.2015] Heess, N.; Wayne, G.; Silver, D.; Lillicrap, T.; Tassa, Y.; and Erez, T. 2015. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems.
 [Hessel et al.2018] Hessel, M.; Modayil, J.; van Hasselt, H.; Schaul, T.; Ostrovski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, M.; and Silver, D. 2018. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence.
 [Jaderberg et al.2017] Jaderberg, M.; Mnih, V.; Czarnecki, W. M.; Schaul, T.; Leibo, J. Z.; Silver, D.; and Kavukcuoglu, K. 2017. Reinforcement learning with unsupervised auxiliary tasks. In Proceedings of the International Conference on Learning Representations.
 [Kingma and Ba2015] Kingma, D. P., and Ba, J. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations.

[Leibfried, Kushman, and
Hofmann2017]
Leibfried, F.; Kushman, N.; and Hofmann, K.
2017.
A deep learning approach for joint video frame and reward prediction in Atari games.
ICML Workshop.  [Levine and Abbeel2014] Levine, S., and Abbeel, P. 2014. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems.
 [Levine and Koltun2013] Levine, S., and Koltun, V. 2013. Guided policy search. In Proceedings of the International Conference on Machine Learning.
 [Levine et al.2016] Levine, S.; Finn, C.; Darrell, T.; and Abbeel, P. 2016. Endtoend training of deep visuomotor policies. Journal of Machine Learning Research 17:1–40.
 [Lillicrap et al.2017] Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2017. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations.
 [Mnih et al.2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.; Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg, S.; and Hassabis, D. 2015. Humanlevel control through deep reinforcement learning. Nature 518(7540):529–533.
 [Oh et al.2015] Oh, J.; Guo, X.; Lee, H.; Lewis, R.; and Singh, S. 2015. Actionconditional video prediction using deep networks in Atari games. In Advances in Neural Information Processing Systems.

[Parr et al.2008]
Parr, R.; Li, L.; Taylor, G.; PainterWakefield, C.; and Littman, M. L.
2008.
An analysis of linear models, linear valuefunction approximation, and feature selection for reinforcement learning.
In Proceedings of the International Conference on Machine Learning.  [Pathak et al.2017] Pathak, D.; Agrawal, P.; Efros, A. A.; and Darrell, T. 2017. Curiositydriven exploration by selfsupervised prediction. In Proceedings of the International Conference on Machine Learning.
 [Pong et al.2018] Pong, V.; Gu, S.; Dalal, M.; and Levine, S. 2018. Temporal difference models: modelfree deep rl for modelbased control. In Proceedings of the International Conference on Learning Representations.
 [Rakhlin2009] Rakhlin, A. 2009. Lecture notes on online learning. Lecture Notes, University of California, Berkeley.
 [Russell and Norvig2016] Russell, S. J., and Norvig, P. 2016. Artificial Intelligence: A Modern Approach. Pearson Education Limited.
 [Song et al.2016] Song, Z.; Parr, R.; Liao, X.; and Carin, L. 2016. Linear feature encoding for reinforcement learning. In Advances in Neural Information Processing Systems.
 [Stadie, Levine, and Abbeel2015] Stadie, B. C.; Levine, S.; and Abbeel, P. 2015. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv.
 [Sutton and Barto1998] Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learning: An Introduction. MIT Press.
 [Sutton et al.2012] Sutton, R. S.; Szepesvari, C.; Geramifard, A.; and Bowling, M. P. 2012. Dynastyle planning with linear function approximation and prioritized sweeping. arXiv.
 [Sutton1990] Sutton, R. S. 1990. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the International Conference on Machine Learning.
 [van Hasselt, Guez, and Silver2016] van Hasselt, H.; Guez, A.; and Silver, D. 2016. Deep reinforcement learning with double Qlearning. In Proceedings of the AAAI Conference on Artificial Intelligence.
 [Wahlström, Schön, and Deisenroth2015] Wahlström, N.; Schön, T. B.; and Deisenroth, M. P. 2015. From pixels to torques: Policy learning with deep dynamical models. ICML Workshop.
 [Wang, Kosson, and Mu2017] Wang, E.; Kosson, A.; and Mu, T. 2017. Deep action conditional neural network for frame prediction in Atari games. Technical Report, Stanford University.
 [Wang2009] Wang, L. 2009. Model Predictive Control System Design and Implementation using MATLAB. Springer Science and Business Media.
 [Watkins1989] Watkins, C. J. C. H. 1989. Learning from delayed rewards. Ph.D. Dissertation, University of Cambridge.
 [Watter et al.2015] Watter, M.; Springenberg, J. T.; Boedecker, J.; and Riedmiller, M. 2015. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in Neural Information Processing Systems.
 [Weber et al.2017] Weber, T.; Racaniere, S.; Reichert, D.; Buesing, L.; Guez, A.; Rezende, D. J.; Puigdomenech Badia, A.; Vinyals, O.; Heess, N.; Li, Y.; Pascanu, R.; Battaglia, P.; Silver, D.; and Wierstra, D. 2017. Imaginationaugmented agents for deep reinforcement learning. In Advances in Neural Information Processing Systems.
Appendix A Experiments
a.1 Network Architecture
Our proposed network architecture is depicted in Figure 5 and comprises two components. The first component is actionunconditioned and maps the state of the environment to Qvalue estimates for each potential action the agent could take. The second component is actionconditioned and uses, in addition to the state of the environment, the action actually taken by the agent in order to make predictions about the state at the next time step as well as the reward and the terminal flag.
On a more detailed level, there are five informationprocessing stages. The first stage is an encoding that maps the state of the environment at time step , a threedimensional tensor comprising the last
video frames to an internal compressed feature vector
via a sequence of convolutional and ordinary forward connections. The second stage is the Qvalue prediction that maps the internal feature vector to Qvalue predictions for each possible action that the agent could potentially take. The Qvalue prediction path consists of two ordinary forward connections. The third stage transforms the hidden encoding into an actionconditioned decoding by integrating the actionactually taken by the agent. This process first transforms the action into a onehot encoding followed by a forward connection and an elementwise vector multiplication with
leading to. Note that the two layers involved in this elementwise vector multiplication are the only two layers in the network without bias neurons. The fourth stage maps the actionconditioned decoding
into a prediction for the terminal flag and the instantaneous reward in a sequence of forward connections. Both the terminal flag and the rewardare categorical variables. The terminal flag is binary and the reward is ternary because reward values from ALE are clipped to lie in the range
[Mnih et al.2015]. The last stage maps to the video frame at the next time step by using forward and deconvolutional connections [Dosovitskiy, Springenberg, and Brox2015].The network uses linear, rectified linear [Glorot, Bordes, and Bengio2011] and softmax activities. The video frames fed into the network are grayscale images (pixels ) downsampled from the full RGB images from ALE. Following standard literature [Mnih et al.2015, Oh et al.2015], the video frame horizon is and frame skipping is applied. Frame skipping means that at each time step when executing an action in the environment, this action is repeated four times and frames with repeated actions are skipped. The instantaneous rewards are accumulated over skipped frames.
a.2 Training Details
The agent follows an greedy policy with linear annealing over one million steps from to . Agent parameters are represented by a deep neural network with randomly initialized weights according to [Glorot and Bengio2010, Oh et al.2015]. The network is trained for time steps. The target network is updated every steps [Hessel et al.2018]. Training and annealing start after steps. When there is a terminal event in the environment, or the agent loses one life, or an episode exceeds a length of steps, the environment is randomly restarted [Hessel et al.2018]. Random restarting means randomly sampling up to NOOPactions at the beginning of the episode. Environment interactions are stored in a replay memory comprising at most the last time steps.
Network parameters are updated every fourth environment step by sampling a minibatch of trajectories with length from the replay memory—note that in the main paper, we omit the temporal dimension to preserve a clearer view. In practical terms, trajectory samples can enable a more robust objective for learning Qvalues by creating multiple different temporally extended target values for one prediction [Sutton and Barto1998, Hessel et al.2018]. Minibatch samples are used to optimize the objective () by taking a single gradient step with Adam [Kingma and Ba2015] (learning rate , gradient momentum 0.95, squared gradient momentum 0.999 and epsilon ). Gradients are clipped when exceeding a norm of [Leibfried, Kushman, and Hofmann2017]. In practice, both the Qvalue prediction loss and the loss for terminal flag and the reward prediction is clipped for large errors (exceeding [Mnih et al.2015]) and small probability values (below [Leibfried, Kushman, and Hofmann2017]). Because nonzero rewards and terminal events occur less frequently than zerorewards and nonterminal events, they are weighted inversely proportionally to their relative occurrence.
Appendix B Theory
In the remainder of the appendix, we detail the mathematical notions needed for our main theorem and lemmas. We begin by providing a mathematical description of the loss function and then commence to determine the gradients.
b.1 Loss Function Formalization
We use the following standard notations: bold capital letters denote matrices (i.e. ), bold lower case letters denote vectors (i.e. ), and scalars are denoted as . Moreover, the component of the vector and entry of the matrix are denoted as and respectively. Also note that in the main paper, the loss function has a superindex indicating rounds of game play required for the regret analysis. In the remainder of this appendix, we omit this superindex for the sake of clarity. Let us now compute the gradient of the following function:
(9)  
where
(10)  
and where with , and . Note that compared to the main paper, we introduce a superindex explicitly indicating minibatch samples obtained from the replay memory for training. Using convolution, we construct matrices with for . Here, the input matrices and variables . We collect these matrices in one matrix . After applying the activation function elementwise to this matrix , we then unroll to a long vector , where we introduce
. Using (unknown) linear transformation matrices
and , we are ready to define the function from :(11) 
Given an input vector and two (unknown) linear transformation matrices and , we can define the second loss function:
(12) 
Similarly, using unknown transformation matrices , we can define the third loss function:
(13) 
Finally, we now describe the last loss function , using (unknown) linear transformation and (unknown) weight matrices and , where , then:
(14) 
and where matrices are formed after converting . It is worth mentioning, that the size of the matrix is specified by the application. The deconvolution operation is:
(15) 
where the matrix is defined as follows:
(16) 
Note that the full gradient of the loss function is:
(17)  
Comments
There are no comments yet.