Model-Based Stabilisation of Deep Reinforcement Learning

09/06/2018 ∙ by Felix Leibfried, et al. ∙ 0

Though successful in high-dimensional domains, deep reinforcement learning exhibits high sample complexity and suffers from stability issues as reported by researchers and practitioners in the field. These problems hinder the application of such algorithms in real-world and safety-critical scenarios. In this paper, we take steps towards stable and efficient reinforcement learning by following a model-based approach that is known to reduce agent-environment interactions. Namely, our method augments deep Q-networks (DQNs) with model predictions for transitions, rewards, and termination flags. Having the model at hand, we then conduct a rigorous theoretical study of our algorithm and show, for the first time, convergence to a stationary point. En route, we provide a counter-example showing that 'vanilla' DQNs can diverge confirming practitioners' and researchers' experiences. Our proof is novel in its own right and can be extended to other forms of deep reinforcement learning. In particular, we believe exploiting the relation between reinforcement (with deep function approximators) and online learning can serve as a recipe for future proofs in the domain. Finally, we validate our theoretical results in 20 games from the Atari benchmark. Our results show that following the proposed model-based learning approach not only ensures convergence but leads to a reduction in sample complexity and superior performance.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Model-free deep reinforcement learning methods have recently demonstrated impressive performance on a range of complex learning tasks [Hessel et al.2018, Lillicrap et al.2017, Jaderberg et al.2017]. Deep Q-Networks (DQNs) [Mnih et al.2015], in particular, stand out as a versatile and effective tool for a wide range of applications. DQNs offer an end-to-end solution for many reinforcement learning problems by generalizing tabular Q-learning to high-dimensional input spaces. Unfortunately, DQNs still suffer from a number of important drawbacks that hinder wide-spread adoption. There is a general lack of theoretical understanding to guide the use of deep Q-learning algorithms. Combining non-linear function approximation with off-policy reinforcement learning is known to lead to possible divergence issues, even in very simple tasks [Boyan and Moore1995]. Despite improvements introduced by deep Q-networks [Mnih et al.2015]

, Q-learning approaches based on deep neural networks still exhibit occasional instability during the learning process, see for example

[van Hasselt, Guez, and Silver2016].

In this paper, we aim to address the instability of current deep RL approaches by augmenting model-free value learning with a predictive model. We propose a hybrid model and value learning objective, which stabilises the learning process and reduces sample complexity when compared to model-free algorithms. We attribute this reduction to the richer feedback signal provided by our method compared to standard deep Q-networks. In particular, our model provides feedback at every time step based on the current prediction error, which in turn eliminates one source of sample inefficiency caused by sparse rewards. These conclusions are also in accordance with previous research conducted on linear function approximation in reinforcement learning [Parr et al.2008, Sutton et al.2012, Song et al.2016].

While linear function approximation results provide a motivation for model-based stabilisation, such theories fail to generalise to deep non-linear architectures [Song et al.2016]. To close this gap in the literature and demonstrate stability of our method, we prove convergence in the general deep RL setting with deep value function approximators. Theoretically analysing deep RL algorithms, however, is challenging due to nonlinear functional dependencies and objective non-convexities prohibiting convergence to globally optimal policies. As such, the best one would hope for is a stationary point (e.g., a first-order one) that guarantees vanishing gradients. Even understanding gradient behavior for deep networks in RL can be difficult due to exploration-based policies and replay memory considerations.

To alleviate some of these problems, we map deep reinforcement learning to a mathematical framework explicitly designed for understanding the exploration/exploitation trade-off. Namely, we formalise the problem as an online learning game the agent plays against an adversary (i.e., the environment). Such a link allows us to study a more general problem combining notions of regret, reinforcement learning, and optimisation. Given such a link, we prove that for any , regret vanishes as with being the total number of rounds. This, in turn, guarantees convergence to a stationary point.

Guided by the above theoretical results, we validate that our algorithm leads to faster learning by conducting experiments on 20 Atari games. Due to the high computational and budgeting demands, we chose a benchmark that is closest to our setting. Concretely, we picked DQNs as our template algorithm to improve on both theoretically and empirically. It is worth noting that extending our theoretical results to more complex RL algorithms is an orthogonal future direction.

Background: Reinforcement Learning

We consider a discrete time, infinite horizon, discounted Markov Decision Process (MDP)

. Here denotes the state space, the set of possible actions, is the reward function, is the state transition function, and is a discount factor. An agent, being in state at time step , executes an action sampled from its policy

, a conditional probability distribution where

. The agent’s action elicits from the environment a reward signal indicating instantaneous reward, a terminal flag indicating a terminal event that restarts the environment, and a transition to a successor state . We assume that the sets , , and are discrete. The reward is sampled from the conditional probability distribution . Similarly, with , where a terminal event () restarts the environment according to some initial state distribution . The state transition to a successor state is determined by a stochastic state transition function according to .

The agent’s goal is to maximise future cumulative reward

with respect to the policy . An important quantity in RL are Q-values , which are defined as the expected future cumulative reward value when executing action in state and subsequently following policy . Q-values enable us to conveniently phrase the RL problem as , where is the stationary state distribution obtained when executing in the environment (starting from ).

Deep Q-Networks:

Value-based reinforcement learning approaches identify optimal Q-values directly using parametric function approximators , where represents the parameters [Watkins1989, Busoniu et al.2010]

. Optimal Q-value estimates then correspond to an optimal policy

. Deep Q-networks [Mnih et al.2015]

learn a deep neural network based Q-value approximation by performing stochastic gradient descent on the following training objective:


The expectation ranges over transitions sampled from an experience replay memory ( denotes the state at the next time step). Use of this replay memory, together with the use of a separate target network (with different parameters ) for calculating the bootstrap values , helps stabilise the learning process.

Background: Online Learning

In this paper, we employ a special form of regret minimisation games that we briefly review here. A regret minimisation game is a triple , where is a non-empty decision set, is the set of moves of the adversary which contains bounded convex functions from to and is the total number of rounds. The game commences in rounds, where at round , the agent chooses a prediction

and the adversary a loss function

. At the end of the round, the adversary reveals its choice and the agent suffers a loss . In this paper, we are concerned with the full-information case where the agent can observe the complete loss function at the end of each round. The goal of the game is for the agent to make successive predictions to minimise cumulative regret defined as:


To use such a framework in our analysis, we require a map from RL to online learning, achieved via a generalisation of the loss function in Equation (1) as described later.

Problem Definition: Model-Based Stabilisation

Our goal is to stabilise deep RL with a predictive model component. We therefore extend model-free deep Q-networks [Mnih et al.2015]111For the sake of brevity, we present our approach using the ’vanilla’ DQN framework, as originally proposed by [Mnih et al.2015]. The current state-of-the-art framework for value-based deep RL is the Rainbow framework [Hessel et al.2018], which is a combination of several independent DQN improvements. However, because these extensions do not use an environment model, the approach presented in this paper is orthogonal to the individual Rainbow components and could be combined with Rainbow. to predict the future environment in addition to Q-values. In this section, we describe our model-based DQN-extension. In the next section, we prove that this indeed leads to convergence.

To enable model-based stabilisation, we add three new action-conditioned output heads to the standard DQN architecture. These heads represent reward, termination, and next observation predictors. The new outputs share the feature learning pipeline with the main Q-network (excluding the final two fully-connected layers). For the reward and termination prediction, we add additional fully-connected layers, whereas the next-observation prediction uses a deconvolutional structure (see appendix for the exact architecture).

The optimization objective for training the network comprises four individual loss functions that are jointly minimized with respect to the network weights : one for Q-value prediction and three additional loss functions. The first additional loss function is for predicting the terminal flag , the second for predicting the instantaneous reward , and the third for predicting the video frame at the next time step . All these parts are additively connected leading to a compound loss


where , , and are positive coefficients to balance the individual parts. The compound loss is an off-policy objective that can be trained with environment interactions obtained from any policy. The individual parts of Equation (3) can then be expressed as expectations over transitions sampled from a replay memory.

The Q-value prediction loss can be written as in Equation (1), whereas the terminal flag prediction loss is


where refers to the terminal flag predictor given a specific state and action. Note that the terminal flag is binary and is a parametric categorical probability distribution.

Similarly, the loss for instantaneous reward prediction is


where is a state-action-conditioned parametric categorical distribution over the reward signal.

The loss for predicting the video frame at the next time step can be formulated as


where refers to a deterministic parametric map to predict the next state observation (denoted ) , given the current state and the action (i.e. predict the next video frame given the last four video frames and action in Atari). refers to the squared Frobenius norm between two images.

Our model-based algorithm commences in rounds, see Algorithm 1. At each time step, the agent observes the current environmental state and takes an epsilon-greedy action with respect to the current Q-values. This leads the environment to elicit a reward signal, a terminal flag, and transition to a successor state. The experience replay is updated accordingly. Every fourth interaction with the environment, the agent samples experiences from the experience replay to perform a gradient update step on the loss from Equation (3). The target network is periodically updated.

initialize replay memory and network weights
initialize target network weights and
start environment, yielding an initial
while  do
      observe and choose eps-greedy from
      execute and observe , and
      add -tuple to
      if  then
           sample from
           perform gradient update step on
           end if
          set  if  
           set  if    else  restart environment
           end while
Algorithm 1 Pseudocode for our algorithm. Details can be found in the appendix.

Theoretical Analysis

Here, we confirm that our model-based approach stabilises the learning process by guaranteeing convergence to a stationary point in parameter space. To derive our convergence guarantees, we map the optimization problem in Equation (3) to an equivalent online learning form.

Before constructing this map, we augment the optimisation objective with a regulariser . The goal of such an addition is to stabilise the learning objective and reduce over-fitting that can occur when considering deep networks. Most importantly, however, we incorporate the regulariser to avoid degenerate solutions that fail to achieve non-trivial regret guarantees (e.g., Hannan’s consistency), see [Rakhlin2009]. We do not assume a specific regulariser, however we assume a convex form (e.g., L, L norms).

With this in mind, the learning objective on the iteration can be written as:


where is defined as:

In words, Equation (7) states that the goal of the agent is to choose a parameter that minimises the accumulated loss so far, while taking into account regularisation. That being said, it is customary in reinforcement learning not to consider an equally weighted summation over iterations. Practitioners typically consider a decaying moving average, which focuses more on losses recently encountered and less on historical ones. To adhere to such practices, we rewrite our objective to include a round-dependent weighting as:


Rather than heuristically choosing

, we defer to our theoretical guarantees for its optimal setting. Though Equation (8) looks as if it parts ways with the original objective, it can be easily seen as a generalisation. We can recover the original objective by choosing if and otherwise (note that the optimization objective in DQN-style algorithms is round-dependent since both the experience replay and the target network change over time). Hence, if we are able to establish convergence properties for Equation (8), we can easily recover convergence for the special case.

Analysing the convergence of Equation (8) needs to be done with care as the summation limit is round-dependent, and hence, needs to be considered as a streaming model. An ideal framework to understand the theoretical properties of such problems is that of online learning as discussed above. To see the connection between Equation (8) and online learning, we commence by defining , and . Hence, one can write:

Allowing to write the solution of Equation (8) recursively:

which is a standard online learning formulation, see [Rakhlin2009, Abbasi-Yadkori et al.2013].

As in any standard online learning game, the goal for the agent is to determine a set of models that minimize accumulated regret, , after rounds. From Equation (2), one can see that regret analysis compares the partial information (online) solution to the full information one, i.e., the solution acquired after observing all data. It is also clear that a desirable property of the algorithm is to exhibit sub-linear regrets as the number of rounds increases. Such a property, in turn, guarantees vanishing average regret, i.e., in terms of rounds. If achieved, one can then conclude -convergence for the algorithm, see [Abbasi-Yadkori et al.2013].

Regret Bound

This section is dedicated to proving vanishing regrets of our optimization algorithm. Our proof consists of three main steps. First, we approximate the original objective using a first-order Taylor expansion. In order not to have a crude approximation, we vary the expansion’s operating point to be the previous round constrained solution. Under this approximation, we commence to bound the norm of the gradient in the general setting of convolutional/deconvolutional deep neural networks in terms of the regularization coefficients . Given this result, we schedule each of these free parameters to acquire vanishing regrets. We summarize our results in the following theorem:

Theorem 1.

Assume a compact, convex, and closed set , for all

, Lipschitz continuous activation functions with Lipschitz constant

(e.g., softmax). Choosing , for any , the regret after rounds vanishes, abiding by:

for any .

The complete proof of the theorem is rather involved and as such is left to the appendix. In what comes next, we elaborate on the main steps needed to arrive at the theorem.

Bounding the gradient

To bound the gradient, we approximate the full objective at round using a first-order Taylor expansion around the previous round solution . Though increasing the complexity of our analysis, we believe a time-dependent operating point based on previous round solutions is crucial to keep track of round-dependent changes, thus avoiding crude approximation to the original objective. For any , the first-order expansion of is:


As such, can be bounded by:

Hence, to bound , we need to bound each of , and .

Using our assumptions, it is easy to see that because belongs to . For the remaining two norms, we can prove the following:

Lemma 1.

Assume a compact, convex, and closed set , for all , Lipschitz continuous activation functions with Lipschitz constant (e.g., softmax). The norms of and are bounded by:

where are constants in . Their exact values can be found in the appendix due to space constraints. Furthermore, the loss is bounded by:

Again, the values for are in the appendix.

Figure 1: Predictions in Kung Fu Master. The top row shows the ground truth trajectories over eleven time steps (44 frames) and the bottom row shows the action-conditioned predictions of our approach. Predictions remain accurate when unrolling the network over time. In this example, the reward signal is perfectly predicted. In time steps 10 and 11, a random opponent enters the scene from the left which is not predictable when merely utilizing information from the start of the prediction at step 0.

Figure 2: Training loss in Kung Fu Master. All individual loss parts from Equation (3) are depicted as a function of training iterations. The loss of the DQN baseline (gray) consists only of the Q-value loss from Equation (1). The compound loss of our approach (black) is dominated by the Q-value loss and only mildly affected by the additional loss parts.

Proof Roadmap

Due to the usage of a deep convolutional/deconvolutional network to approximate Q-values, the proof of the above lemma is involved and as such is left to the supplementary material. In short, the proof consists of three main steps. In the first, we formally derive the output of each of the network layers in terms of input image tensors. Given this formalization, in the second step, we derive the gradient with respect to the unknown parameters in closed form. Incorporating these results with our assumptions, using the triangular inequality and Cauchy-Schwarz, we arrive at the statement of the lemma.

Completing the Regret Proof

To prove the statement of Theorem 1, the remaining step is to incorporate the above bounds into the results of [Abbasi-Yadkori et al.2013]. Namely, we can show that the regret behaves as:

where are constants that can be found in the appendix. Now, we choose optimal values for each each of and ’s to acquire sub-linear regret behavior.

Among different choices for these hyperparameters, our results need to relate to common practices in deep reinforcement learning. For instance, deep RL weighs current round losses higher compared to history (e.g., exponential decaying window). This can be seen as choosing

. Furthermore , and setting for guarantees

where . Therefore, the overall regret bound is given by . This finalises the statement of the theorem guaranteeing vanishing average regret, i.e., .

Figure 3: Individual and median game score. The plots compare the game score of our model-stabilised approach (black) to model-free DQN (gray) as a function of training iterations. The left plots report game scores on individual Atari games. The rightmost plot shows the median game score across 20 Atari games. The median plot reports human-normalized scores in order to average over games. Our approach is significantly better than DQN on each individual game and on average.

Figure 4: Individual and median sample complexity. Sample complexity is measured in terms of the number of environment interactions at which best DQN performance is attained first. Five individual Atari games are depicted on the left comparing DQN (gray) to our approach (black). The rightmost plot shows the median sample complexity over all 20 games. Our approach is more sample-efficient on each individual game depicted and on average.

Remark: The above proof verifies that our method exhibits vanishing regrets and therefore converges to a stationary point. We note that this result does not hold for the unregularized case of basic DQN learning. A counter-example can be easily constructed where the regret of DQN after becomes at least linear (see appendix).
A second observation is that the regularized objective in Equation (3) can be considered as a constrained optimization objective. Intuitively, this approach shrinks the feasibility set, thereby changing the set of local optima for DQNs.
The result of vanishing regrets also guarantees convergence to a stationary point, where an algorithm with vanishing regrets exhibits convergence.

In the next section, we empirically verify that our algorithm exhibits lower sample complexity. Due to high computational and budgeting demands, we chose the closest algorithm to our setting to conduct experiments against. This is desirable as our results extend deep Q-networks and as such, serve as a realistic benchmark. Please note that extending our method to other forms of deep RL is an orthogonal direction to this work that we leave for future investigation.

Game DQN Our approach
Amidar 22.3% 19.5%
Assault 34.1% 33.3%
Asterix 37.4% 46.4%
Battle Zone 20.2% 83.7%
Berzerk 11.6% 12.3%
Chopper Command 54.9% 86.2%
Crazy Climber 234.3% 378.7%
Kangaroo 242.6% 224.5%
Krull 713.2% 663.9%
Kung Fu Master 91.1% 115.8%
Ms Pacman 28.2% 32.4%
Qbert 72.8% 93.8%
Road Runner 582.2% 615.5%
Robotank 416.8% 122.6%
Seaquest 11.3% 13.6%
Space Invaders 50.0% 50.5%
Star Gunner 511.8% 185.7%
Time Pilot -13.7% 5.8%
Tutankham 161.1% 167.8%
Up’n Down 66.5% 72.7%
Median 60.7% 85.0%
Table 1: Normalized episodic rewards in 20 Atari games.


We empirically validate our approach in the Atari domain [Bellemare et al.2013]. We compare against ordinary RL without model-based stabilisation (DQN, [Mnih et al.2015]). Our aim is to verify that our approach exhibits stable learning and improves learning results. We show that our method leads to significantly better game play performance and sample efficiency across 20 Atari games.

As a proof of concept, in Figure 1, we visualize model predictions and compare to ground truth frames over a time horizon of eleven steps (44 frames) starting from an initial state after executing the policy for steps. In the example depicted, the reward signal is perfectly predicted. Model-predicted video frames accurately match ground truth frames. At time steps 10 and 11, a random object is entering the scene from the left, which our model cannot predict because there is no information available at time step 0 to foresee this event, in accordance with [Oh et al.2015].

We next analyze the different components of the loss proposed in Equation (3). As an illustrative example, we report the losses during training on the game Kung Fu Master (see Figure 2). Clearly, the compound loss is dominated by the Q-value loss and only mildly affected by all other loss parts.

In order to quantify game play performance, we store agent networks every steps during learning and conduct an offline evaluation by averaging over episodes, each of which comprises at most steps (but terminate earlier in case of a terminal event). In evaluation mode, agents follow an epsilon-greedy strategy with [Mnih et al.2015]. The results of this evaluation are depicted in Figure 3 for five individual Atari games and for the median score over all 20 Atari games (smoothed with an exponential window of ). The median is taken by normalizing raw scores with respect to human and random scores according to [Mnih et al.2015].

Our model-based approach significantly outperforms the model-free baseline on each individual game depicted and on average over all 20 games in the course of training. Additionally, we report normalized game scores obtained by the best-performing agent throughout the learning process (see Table 1). Notably, our best-performing agent outperforms the DQN baseline on 14 out of 20 individual games.

To demonstrate sample efficiency, we identify the number of time steps at which maximum DQN performance is attained first. To do so, we smooth the episodic reward obtained via offline evaluation with an exponential window of size . Figure 4 shows that our approach improves sample complexity over the model-free baseline on each of the five games depicted and on average over all 20 games.

Related Work

While model-free RL aims to learn a policy directly from transition samples, model-based RL attempts to be more efficient by learning about the environment and the actual RL task. In general, there are four types of mode-based RL.

(1) Planning approaches use the environment model to solve a virtual RL problem and act accordingly in the real environment [Wang2009, Browne et al.2012, Russell and Norvig2016]. (2) DYNA-style learning [Sutton1990, Sutton et al.2012] augments the dataset with virtual samples for RL training. These virtual samples are generated from a learned environment model and are combined with samples from the actual environment to produce the final learning update. (3) Explicit exploration approaches use models to direct exploration. They encourage the agent to take actions that most likely lead to novel environment states [Stadie, Levine, and Abbeel2015, Oh et al.2015, Pathak et al.2017, Jaderberg et al.2017]. (4) Training a predictive model to extract features for value function approximation. Recent research has identified a relation between model learning and value function approximation [Parr et al.2008, Song et al.2016]. This work provides a theoretical basis for feature learning that connects the model prediction to the Bellman error for linear value function approximation.

Our work fits conceptually into the latter category. The results from [Song et al.2016] do however not readily carry over to non-linear function approximators. Our work fills this gap by theoretically proving that stable deep RL can be obtained by joint value function and model learning. Practically, we demonstrate that this also speeds up the training procedure in terms of better sample complexity in Atari.

The most challenging contemporary environments for model-based RL are robotics and vision-based domains. In robotics, there is a range of model-based RL approaches that have been successfully deployed [Deisenroth and Rasmussen2011, Levine and Koltun2013, Levine and Abbeel2014, Heess et al.2015, Gu et al.2016, Pong et al.2018], even for visual state spaces [Wahlström, Schön, and Deisenroth2015, Watter et al.2015, Finn et al.2016, Levine et al.2016]. However, in other vision-based domains (e.g. Atari), model-based learning has been less successful, despite plenty of model-learning approaches that demonstrably learn accurate environment models [Oh et al.2015, Fu and Hsu2016, Chiappa et al.2017, Leibfried, Kushman, and Hofmann2017, Wang, Kosson, and Mu2017, Weber et al.2017, Buesing et al.2018]. One exception is the preliminary work of [Alaniz2017] obtaining impressive results in Minecraft with Monte Carlo tree search.


In this work, we addressed the problem of instable learning in deep RL. We introduced a new optimization objective and network architecture for deep value-based reinforcement learning by extending conventional deep Q-networks with a model-learning component. In our theoretical analysis, we formally show that our proposed approach convergences to a stationary point in parameter space, whereas ’vanilla’ DQNs can diverge. Empirically our approach yields significant improvements on 20 Atari games in both sample complexity and overall performance when compared to model-free RL without model-based stabilisation.


  • [Abbasi-Yadkori et al.2013] Abbasi-Yadkori, Y.; Bartlett, P. L.; Kanade, V.; Seldin, Y.; and Szepesvari, C. 2013. Online learning in Markov decision processes with adversarially chosen transition probability distributions. In Advances in Neural Information Processing Systems.
  • [Alaniz2017] Alaniz, S. 2017. Deep reinforcement learning with model learning and Monte Carlo tree search in Minecraft. In Proceedings of the Multidisciplinary Conference on Reinforcement Learning and Decision Making.
  • [Bellemare et al.2013] Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M. 2013. The Arcade Learning Environment: An evaluation platform for general agents.

    Journal of Artificial Intelligence Research

  • [Boyan and Moore1995] Boyan, J. A., and Moore, A. W. 1995. Generalization in reinforcement learning: Safely approximating the value function. In Advances in Neural Information Processing Systems.
  • [Browne et al.2012] Browne, C.; Powley, E.; Whitehouse, D.; Lucas, S.; Cowling, P. I.; Rohlfshagen, P.; Tavener, S.; Perez, D.; Samothrakis, S.; and Colton, S. 2012. A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games 4(1):1–49.
  • [Buesing et al.2018] Buesing, L.; Weber, T.; Racaniere, S.; Ali Eslami, S. M.; Rezende, D.; Reichert, D. P.; Viola, F.; Besse, F.; Gregor, K.; Hassabis, D.; and Wierstra, D. 2018. Learning and querying fast generative models for reinforcement rearning. In arXiv.
  • [Busoniu et al.2010] Busoniu, L.; Babuska, R.; De Schutter, B.; and Ernst, D. 2010. Reinforcement Learning and Dynamic Programming using Function Approximators. CRC Press.
  • [Chiappa et al.2017] Chiappa, S.; Racaniere, S.; Wierstra, D.; and Mohamed, S. 2017. Recurrent environment simulators. In Proceedings of the International Conference on Learning Representations.
  • [Deisenroth and Rasmussen2011] Deisenroth, M. P., and Rasmussen, C. E. 2011. PILCO: A model-based and data-efficient approach to policy search. In

    Proceedings of the International Conference on Machine Learning

  • [Dosovitskiy, Springenberg, and Brox2015] Dosovitskiy, A.; Springenberg, J. T.; and Brox, T. 2015.

    Learning to generate chairs with convolutional neural networks.


    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

  • [Finn et al.2016] Finn, C.; Tan, X. Y.; Duan, Y.; Darrell, T.; Levine, S.; and Abbeel, P. 2016.

    Deep spatial autoencoders for visuomotor learning.

    In Proceedings of the IEEE International Conference on Robotics and Automation.
  • [Fu and Hsu2016] Fu, J., and Hsu, I. 2016. Model-based reinforcement learning for playing Atari games. Technical Report, Stanford University.
  • [Glorot and Bengio2010] Glorot, X., and Bengio, Y. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics.
  • [Glorot, Bordes, and Bengio2011] Glorot, X.; Bordes, A.; and Bengio, Y. 2011. Deep sparse rectifier neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics.
  • [Gu et al.2016] Gu, S.; Lillicrap, T.; Sutskever, I.; and Levine, S. 2016. Continuous deep Q-learning with model-based acceleration. In Proceedings of the International Conference on Machine Learning.
  • [Heess et al.2015] Heess, N.; Wayne, G.; Silver, D.; Lillicrap, T.; Tassa, Y.; and Erez, T. 2015. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems.
  • [Hessel et al.2018] Hessel, M.; Modayil, J.; van Hasselt, H.; Schaul, T.; Ostrovski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, M.; and Silver, D. 2018. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence.
  • [Jaderberg et al.2017] Jaderberg, M.; Mnih, V.; Czarnecki, W. M.; Schaul, T.; Leibo, J. Z.; Silver, D.; and Kavukcuoglu, K. 2017. Reinforcement learning with unsupervised auxiliary tasks. In Proceedings of the International Conference on Learning Representations.
  • [Kingma and Ba2015] Kingma, D. P., and Ba, J. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations.
  • [Leibfried, Kushman, and Hofmann2017] Leibfried, F.; Kushman, N.; and Hofmann, K. 2017.

    A deep learning approach for joint video frame and reward prediction in Atari games.

    ICML Workshop.
  • [Levine and Abbeel2014] Levine, S., and Abbeel, P. 2014. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems.
  • [Levine and Koltun2013] Levine, S., and Koltun, V. 2013. Guided policy search. In Proceedings of the International Conference on Machine Learning.
  • [Levine et al.2016] Levine, S.; Finn, C.; Darrell, T.; and Abbeel, P. 2016. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research 17:1–40.
  • [Lillicrap et al.2017] Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2017. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations.
  • [Mnih et al.2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.; Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg, S.; and Hassabis, D. 2015. Human-level control through deep reinforcement learning. Nature 518(7540):529–533.
  • [Oh et al.2015] Oh, J.; Guo, X.; Lee, H.; Lewis, R.; and Singh, S. 2015. Action-conditional video prediction using deep networks in Atari games. In Advances in Neural Information Processing Systems.
  • [Parr et al.2008] Parr, R.; Li, L.; Taylor, G.; Painter-Wakefield, C.; and Littman, M. L. 2008.

    An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning.

    In Proceedings of the International Conference on Machine Learning.
  • [Pathak et al.2017] Pathak, D.; Agrawal, P.; Efros, A. A.; and Darrell, T. 2017. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the International Conference on Machine Learning.
  • [Pong et al.2018] Pong, V.; Gu, S.; Dalal, M.; and Levine, S. 2018. Temporal difference models: model-free deep rl for model-based control. In Proceedings of the International Conference on Learning Representations.
  • [Rakhlin2009] Rakhlin, A. 2009. Lecture notes on online learning. Lecture Notes, University of California, Berkeley.
  • [Russell and Norvig2016] Russell, S. J., and Norvig, P. 2016. Artificial Intelligence: A Modern Approach. Pearson Education Limited.
  • [Song et al.2016] Song, Z.; Parr, R.; Liao, X.; and Carin, L. 2016. Linear feature encoding for reinforcement learning. In Advances in Neural Information Processing Systems.
  • [Stadie, Levine, and Abbeel2015] Stadie, B. C.; Levine, S.; and Abbeel, P. 2015. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv.
  • [Sutton and Barto1998] Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learning: An Introduction. MIT Press.
  • [Sutton et al.2012] Sutton, R. S.; Szepesvari, C.; Geramifard, A.; and Bowling, M. P. 2012. Dyna-style planning with linear function approximation and prioritized sweeping. arXiv.
  • [Sutton1990] Sutton, R. S. 1990. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the International Conference on Machine Learning.
  • [van Hasselt, Guez, and Silver2016] van Hasselt, H.; Guez, A.; and Silver, D. 2016. Deep reinforcement learning with double Q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence.
  • [Wahlström, Schön, and Deisenroth2015] Wahlström, N.; Schön, T. B.; and Deisenroth, M. P. 2015. From pixels to torques: Policy learning with deep dynamical models. ICML Workshop.
  • [Wang, Kosson, and Mu2017] Wang, E.; Kosson, A.; and Mu, T. 2017. Deep action conditional neural network for frame prediction in Atari games. Technical Report, Stanford University.
  • [Wang2009] Wang, L. 2009. Model Predictive Control System Design and Implementation using MATLAB. Springer Science and Business Media.
  • [Watkins1989] Watkins, C. J. C. H. 1989. Learning from delayed rewards. Ph.D. Dissertation, University of Cambridge.
  • [Watter et al.2015] Watter, M.; Springenberg, J. T.; Boedecker, J.; and Riedmiller, M. 2015. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in Neural Information Processing Systems.
  • [Weber et al.2017] Weber, T.; Racaniere, S.; Reichert, D.; Buesing, L.; Guez, A.; Rezende, D. J.; Puigdomenech Badia, A.; Vinyals, O.; Heess, N.; Li, Y.; Pascanu, R.; Battaglia, P.; Silver, D.; and Wierstra, D. 2017. Imagination-augmented agents for deep reinforcement learning. In Advances in Neural Information Processing Systems.

Appendix A Experiments

a.1 Network Architecture

Our proposed network architecture is depicted in Figure 5 and comprises two components. The first component is action-unconditioned and maps the state of the environment to Q-value estimates for each potential action the agent could take. The second component is action-conditioned and uses, in addition to the state of the environment, the action actually taken by the agent in order to make predictions about the state at the next time step as well as the reward and the terminal flag.

On a more detailed level, there are five information-processing stages. The first stage is an encoding that maps the state of the environment at time step , a three-dimensional tensor comprising the last

video frames to an internal compressed feature vector

via a sequence of convolutional and ordinary forward connections. The second stage is the Q-value prediction that maps the internal feature vector to Q-value predictions for each possible action that the agent could potentially take. The Q-value prediction path consists of two ordinary forward connections. The third stage transforms the hidden encoding into an action-conditioned decoding by integrating the action

actually taken by the agent. This process first transforms the action into a one-hot encoding followed by a forward connection and an element-wise vector multiplication with

leading to

. Note that the two layers involved in this element-wise vector multiplication are the only two layers in the network without bias neurons. The fourth stage maps the action-conditioned decoding

into a prediction for the terminal flag and the instantaneous reward in a sequence of forward connections. Both the terminal flag and the reward

are categorical variables. The terminal flag is binary and the reward is ternary because reward values from ALE are clipped to lie in the range

[Mnih et al.2015]. The last stage maps to the video frame at the next time step by using forward and deconvolutional connections [Dosovitskiy, Springenberg, and Brox2015].

The network uses linear, rectified linear [Glorot, Bordes, and Bengio2011] and softmax activities. The video frames fed into the network are grayscale images (pixels ) down-sampled from the full RGB images from ALE. Following standard literature [Mnih et al.2015, Oh et al.2015], the video frame horizon is and frame skipping is applied. Frame skipping means that at each time step when executing an action in the environment, this action is repeated four times and frames with repeated actions are skipped. The instantaneous rewards are accumulated over skipped frames.

a.2 Training Details

The agent follows an -greedy policy with linear -annealing over one million steps from to . Agent parameters are represented by a deep neural network with randomly initialized weights according to [Glorot and Bengio2010, Oh et al.2015]. The network is trained for time steps. The target network is updated every steps [Hessel et al.2018]. Training and -annealing start after steps. When there is a terminal event in the environment, or the agent loses one life, or an episode exceeds a length of steps, the environment is randomly restarted [Hessel et al.2018]. Random restarting means randomly sampling up to NOOP-actions at the beginning of the episode. Environment interactions are stored in a replay memory comprising at most the last time steps.

Network parameters are updated every fourth environment step by sampling a mini-batch of trajectories with length from the replay memory—note that in the main paper, we omit the temporal dimension to preserve a clearer view. In practical terms, trajectory samples can enable a more robust objective for learning Q-values by creating multiple different temporally extended target values for one prediction [Sutton and Barto1998, Hessel et al.2018]. Mini-batch samples are used to optimize the objective () by taking a single gradient step with Adam [Kingma and Ba2015] (learning rate , gradient momentum 0.95, squared gradient momentum 0.999 and epsilon ). Gradients are clipped when exceeding a norm of [Leibfried, Kushman, and Hofmann2017]. In practice, both the Q-value prediction loss and the loss for terminal flag and the reward prediction is clipped for large errors (exceeding [Mnih et al.2015]) and small probability values (below [Leibfried, Kushman, and Hofmann2017]). Because non-zero rewards and terminal events occur less frequently than zero-rewards and non-terminal events, they are weighted inversely proportionally to their relative occurrence.

Figure 5:

Network architecture. The base network is a deep Q-network that maps the current state given by the last four video frames to Q-value estimates for each potential action. This base network is extended by an action-conditioned path that enables the estimation of the current terminal flag, the instantaneous reward and the next-time-step video frame. Network inputs are colored in red and network outputs in blue. Computational units comprise convolutional (’Conv’), ordinary forward (’FC’), and deconvolutional (’Deconv’) connections that are combined with linear (’Lin’), rectified linear (’ReLU’), or softmax (’SM’) activations. The operator

denotes element-wise multiplication.

Appendix B Theory

In the remainder of the appendix, we detail the mathematical notions needed for our main theorem and lemmas. We begin by providing a mathematical description of the loss function and then commence to determine the gradients.

b.1 Loss Function Formalization

We use the following standard notations: bold capital letters denote matrices (i.e. ), bold lower case letters denote vectors (i.e. ), and scalars are denoted as . Moreover, the component of the vector and entry of the matrix are denoted as and respectively. Also note that in the main paper, the loss function has a super-index indicating rounds of game play required for the regret analysis. In the remainder of this appendix, we omit this super-index for the sake of clarity. Let us now compute the gradient of the following function:




and where with , and . Note that compared to the main paper, we introduce a super-index explicitly indicating mini-batch samples obtained from the replay memory for training. Using convolution, we construct matrices with for . Here, the input matrices and variables . We collect these matrices in one matrix . After applying the activation function element-wise to this matrix , we then unroll to a long vector , where we introduce

. Using (unknown) linear transformation matrices

and , we are ready to define the function from :


Given an input vector and two (unknown) linear transformation matrices and , we can define the second loss function:


Similarly, using unknown transformation matrices , we can define the third loss function:


Finally, we now describe the last loss function , using (unknown) linear transformation and (unknown) weight matrices and , where , then:


and where matrices are formed after converting . It is worth mentioning, that the size of the matrix is specified by the application. The deconvolution operation is:


where the matrix is defined as follows:


Note that the full gradient of the loss function is: