
Playing Atari with Deep Reinforcement Learning
We present the first deep learning model to successfully learn control policies directly from highdimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Qlearning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
12/19/2013 ∙ by Volodymyr Mnih, et al. ∙ 0 ∙ shareread it

Asynchronous Methods for Deep Reinforcement Learning
We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actorlearners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actorcritic, surpasses the current stateoftheart on the Atari domain while training for half the time on a single multicore CPU instead of a GPU. Furthermore, we show that asynchronous actorcritic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.
02/04/2016 ∙ by Volodymyr Mnih, et al. ∙ 0 ∙ shareread it

Reinforcement Learning with Unsupervised Auxiliary Tasks
Deep reinforcement learning agents have achieved stateoftheart results by directly maximising cumulative reward. However, environments contain a much wider variety of possible training signals. In this paper, we introduce an agent that also maximises many other pseudoreward functions simultaneously by reinforcement learning. All of these tasks share a common representation that, like unsupervised learning, continues to develop in the absence of extrinsic rewards. We also introduce a novel mechanism for focusing this representation upon extrinsic rewards, so that learning can rapidly adapt to the most relevant aspects of the actual task. Our agent significantly outperforms the previous stateoftheart on Atari, averaging 880% expert human performance, and a challenging suite of firstperson, threedimensional Labyrinth tasks leading to a mean speedup in learning of 10× and averaging 87% expert human performance on Labyrinth.
11/16/2016 ∙ by Max Jaderberg, et al. ∙ 0 ∙ shareread it

The Uncertainty Bellman Equation and Exploration
We consider the exploration/exploitation problem in reinforcement learning. For exploitation, it is well known that the Bellman equation connects the value at any timestep to the expected value at subsequent timesteps. In this paper we consider a similar uncertainty Bellman equation (UBE), which connects the uncertainty at any timestep to the expected uncertainties at subsequent timesteps, thereby extending the potential exploratory benefit of a policy beyond individual timesteps. We prove that the unique fixed point of the UBE yields an upper bound on the variance of the estimated value of any fixed policy. This bound can be much tighter than traditional countbased bonuses that compound standard deviation rather than variance. Importantly, and unlike several existing approaches to optimism, this method scales naturally to large systems with complex generalization. Substituting our UBEexploration strategy for ϵgreedy improves DQN performance on 51 out of 57 games in the Atari suite.
09/15/2017 ∙ by Brendan O'Donoghue, et al. ∙ 0 ∙ shareread it

Massively Parallel Methods for Deep Reinforcement Learning
We present the first massively distributed architecture for deep reinforcement learning. This architecture uses four main components: parallel actors that generate new behaviour; parallel learners that are trained from stored experience; a distributed neural network to represent the value function or behaviour policy; and a distributed store of experience. We used our architecture to implement the Deep QNetwork algorithm (DQN). Our distributed algorithm was applied to 49 games from Atari 2600 games from the Arcade Learning Environment, using identical hyperparameters. Our performance surpassed nondistributed DQN in 41 of the 49 games and also reduced the walltime required to achieve these results by an order of magnitude on most games.
07/15/2015 ∙ by Arun Nair, et al. ∙ 0 ∙ shareread it

Strategic Attentive Writer for Learning MacroActions
We present a novel deep recurrent neural network architecture that learns to build implicit plans in an endtoend manner by purely interacting with an environment in reinforcement learning setting. The network builds an internal plan, which is continuously updated upon observation of the next input from the environment. It can also partition this internal representation into contiguous sub sequences by learning for how long the plan can be committed to  i.e. followed without replaning. Combining these properties, the proposed model, dubbed STRategic Attentive Writer (STRAW) can learn highlevel, temporally abstracted macro actions of varying lengths that are solely learnt from data without any prior information. These macroactions enable both structured exploration and economic computation. We experimentally demonstrate that STRAW delivers strong improvements on several ATARI games by employing temporally extended planning strategies (e.g. Ms. Pacman and Frostbite). It is at the same time a general algorithm that can be applied on any sequence data. To that end, we also show that when trained on text prediction task, STRAW naturally predicts frequent ngrams (instead of macroactions), demonstrating the generality of the approach.
06/15/2016 ∙ by Alexander, et al. ∙ 0 ∙ shareread it

Combining policy gradient and Qlearning
Policy gradient is an efficient technique for improving a policy in a reinforcement learning setting. However, vanilla online variants are onpolicy only and not able to take advantage of offpolicy data. In this paper we describe a new technique that combines policy gradient with offpolicy Qlearning, drawing experience from a replay buffer. This is motivated by making a connection between the fixed points of the regularized policy gradient algorithm and the Qvalues. This connection allows us to estimate the Qvalues from the action preferences of the policy, to which we apply Qlearning updates. We refer to the new technique as 'PGQL', for policy gradient and Qlearning. We also establish an equivalency between actionvalue fitting techniques and actorcritic algorithms, showing that regularized policy gradient techniques can be interpreted as advantage function learning algorithms. We conclude with some numerical examples that demonstrate improved data efficiency and stability of PGQL. In particular, we tested PGQL on the full suite of Atari games and achieved performance exceeding that of both asynchronous advantage actorcritic (A3C) and Qlearning.
11/05/2016 ∙ by Brendan O'Donoghue, et al. ∙ 0 ∙ shareread it

Using Fast Weights to Attend to the Recent Past
Until recently, research on artificial neural networks was largely restricted to systems with only two types of variable: Neural activities that represent the current or recent input and weights that learn to capture regularities among inputs, outputs and payoffs. There is no good reason for this restriction. Synapses have dynamics at many different timescales and this suggests that artificial neural networks might benefit from variables that change slower than activities but much faster than the standard weights. These "fast weights" can be used to store temporary memories of the recent past and they provide a neurally plausible way of implementing the type of attention to the past that has recently proved very helpful in sequencetosequence models. By using fast weights we can avoid the need to store copies of neural activity patterns.
10/20/2016 ∙ by Jimmy Ba, et al. ∙ 0 ∙ shareread it

Learning values across many orders of magnitude
Most learning algorithms are not invariant to the scale of the function that is being approximated. We propose to adaptively normalize the targets used in learning. This is useful in valuebased reinforcement learning, where the magnitude of appropriate value approximations can change over time when we update the policy of behavior. Our main motivation is prior work on learning to play Atari games, where the rewards were all clipped to a predetermined range. This clipping facilitates learning across many different games with a single learning algorithm, but a clipped reward function can result in qualitatively different behavior. Using the adaptive normalization we can remove this domainspecific heuristic without diminishing overall performance.
02/24/2016 ∙ by Hado van Hasselt, et al. ∙ 0 ∙ shareread it

Multiple Object Recognition with Visual Attention
We present an attentionbased model for recognizing multiple objects in images. The proposed model is a deep recurrent neural network trained with reinforcement learning to attend to the most relevant regions of the input image. We show that the model learns to both localize and recognize multiple objects despite being given only class labels during training. We evaluate the model on the challenging task of transcribing house number sequences from Google Street View images and show that it is both more accurate than the stateoftheart convolutional networks and uses fewer parameters and less computation.
12/24/2014 ∙ by Jimmy Ba, et al. ∙ 0 ∙ shareread it

Conditional Restricted Boltzmann Machines for Structured Output Prediction
Conditional Restricted Boltzmann Machines (CRBMs) are rich probabilistic models that have recently been applied to a wide range of problems, including collaborative filtering, classification, and modeling motion capture data. While much progress has been made in training nonconditional RBMs, these algorithms are not applicable to conditional models and there has been almost no work on training and generating predictions from conditional RBMs for structured output problems. We first argue that standard Contrastive Divergencebased learning may not be suitable for training CRBMs. We then identify two distinct types of structured output prediction problems and propose an improved learning algorithm for each. The first problem type is one where the output space has arbitrary structure but the set of likely output configurations is relatively small, such as in multilabel classification. The second problem is one where the output space is arbitrarily structured but where the output space variability is much greater, such as in image denoising or pixel labeling. We show that the new learning algorithms can work much better than Contrastive Divergence on both types of problems.
02/14/2012 ∙ by Volodymyr Mnih, et al. ∙ 0 ∙ shareread it
Volodymyr Mnih
is this you? claim profile
PhD in Machine Learning at the University of Toronto, Research Scientist at Google DeepMind