
An investigation of modelfree planning
The field of reinforcement learning (RL) is facing increasingly challenging domains with combinatorial complexity. For an RL agent to address these challenges, it is essential that it can plan effectively. Prior work has typically utilized an explicit model of the environment, combined with a specific planning algorithm (such as tree search). More recently, a new family of methods have been proposed that learn how to plan, by providing the structure for planning via an inductive bias in the function approximator (such as a tree structured neural network), trained endtoend by a modelfree RL algorithm. In this paper, we go even further, and demonstrate empirically that an entirely modelfree approach, without special structure beyond standard neural network components such as convolutional networks and LSTMs, can learn to exhibit many of the characteristics typically associated with a modelbased planner. We measure our agent's effectiveness at planning in terms of its ability to generalize across a combinatorial and irreversible state space, its data efficiency, and its ability to utilize additional thinking time. We find that our agent has many of the characteristics that one might expect to find in a planning algorithm. Furthermore, it exceeds the stateoftheart in challenging combinatorial domains such as Sokoban and outperforms other modelfree approaches that utilize strong inductive biases toward planning.
01/11/2019 ∙ by Arthur Guez, et al. ∙ 10 ∙ shareread it

Relational recurrent neural networks
Memorybased neural networks model temporal data by leveraging an ability to remember information for long periods. It is unclear, however, whether they also have an ability to perform complex relational reasoning with the information they remember. Here, we first confirm our intuitions that standard memory architectures may struggle at tasks that heavily involve an understanding of the ways in which entities are connected  i.e., tasks involving relational reasoning. We then improve upon these deficits by using a new memory module  a Relational Memory Core (RMC)  which employs multihead dot product attention to allow memories to interact. Finally, we test the RMC on a suite of tasks that may profit from more capable relational reasoning across sequential information, and show large gains in RL domains (e.g. Mini PacMan), program evaluation, and language modeling, achieving stateoftheart results on the WikiText103, Project Gutenberg, and GigaWord datasets.
06/05/2018 ∙ by Adam Santoro, et al. ∙ 2 ∙ shareread it

ImaginationAugmented Agents for Deep Reinforcement Learning
We introduce ImaginationAugmented Agents (I2As), a novel architecture for deep reinforcement learning combining modelfree and modelbased aspects. In contrast to most existing modelbased reinforcement learning and planning methods, which prescribe how a model should be used to arrive at a policy, I2As learn to interpret predictions from a learned environment model to construct implicit plans in arbitrary ways, by using the predictions as additional context in deep policy networks. I2As show improved data efficiency, performance, and robustness to model misspecification compared to several baselines.
07/19/2017 ∙ by Theophane Weber, et al. ∙ 0 ∙ shareread it

Learning modelbased planning from scratch
Conventional wisdom holds that modelbased planning is a powerful approach to sequential decisionmaking. It is often very challenging in practice, however, because while a model can be used to evaluate a plan, it does not prescribe how to construct a plan. Here we introduce the "Imaginationbased Planner", the first modelbased, sequential decisionmaking agent that can learn to construct, evaluate, and execute plans. Before any action, it can perform a variable number of imagination steps, which involve proposing an imagined action and evaluating it with its modelbased imagination. All imagined actions and outcomes are aggregated, iteratively, into a "plan context" which conditions future real and imagined actions. The agent can even decide how to imagine: testing out alternative imagined actions, chaining sequences of actions together, or building a more complex "imagination tree" by navigating flexibly among the previously imagined states using a learned policy. And our agent can learn to plan economically, jointly optimizing for external rewards and computational costs associated with using its imagination. We show that our architecture can learn to solve a challenging continuous control problem, and also learn elaborate planning strategies in a discrete mazesolving task. Our work opens a new direction toward learning the components of a modelbased planning system and how to use them.
07/19/2017 ∙ by Razvan Pascanu, et al. ∙ 0 ∙ shareread it

Visual Interaction Networks
From just a glance, humans can make rich predictions about the future state of a wide range of physical systems. On the other hand, modern approaches from engineering, robotics, and graphics are often restricted to narrow domains and require direct measurements of the underlying states. We introduce the Visual Interaction Network, a generalpurpose model for learning the dynamics of a physical system from raw visual observations. Our model consists of a perceptual frontend based on convolutional neural networks and a dynamics predictor based on interaction networks. Through joint training, the perceptual frontend learns to parse a dynamic visual scene into a set of factored latent object representations. The dynamics predictor learns to roll these states forward in time by computing their interactions and dynamics, producing a predicted physical trajectory of arbitrary length. We found that from just six input video frames the Visual Interaction Network can generate accurate future trajectories of hundreds of time steps on a wide range of physical systems. Our model can also be applied to scenes with invisible objects, inferring their future states from their effects on the visible objects, and can implicitly infer the unknown mass of objects. Our results demonstrate that the perceptual module and the objectbased dynamics predictor module can induce factored latent representations that support accurate dynamical predictions. This work opens new opportunities for modelbased decisionmaking and planning from raw sensory observations in complex physical environments.
06/05/2017 ∙ by Nicholas Watters, et al. ∙ 0 ∙ shareread it

Deep Reinforcement Learning in Large Discrete Action Spaces
Being able to reason in an environment with a large number of discrete actions is essential to bringing reinforcement learning to a larger class of problems. Recommender systems, industrial plants and language models are only some of the many realworld tasks involving large numbers of discrete actions for which current methods are difficult or even often impossible to apply. An ability to generalize over the set of actions as well as sublinear complexity relative to the size of the set are both necessary to handle such tasks. Current approaches are not able to provide both of these, which motivates the work in this paper. Our proposed approach leverages prior information about the actions to embed them in a continuous space upon which it can generalize. Additionally, approximate nearestneighbor methods allow for logarithmictime lookup complexity relative to the number of actions, which is necessary for timewise tractable training. This combined approach allows reinforcement learning methods to be applied to largescale learning problems previously intractable with current methods. We demonstrate our algorithm's abilities on a series of tasks having up to one million actions.
12/24/2015 ∙ by Gabriel DulacArnold, et al. ∙ 0 ∙ shareread it

Attend, Infer, Repeat: Fast Scene Understanding with Generative Models
We present a framework for efficient inference in structured image models that explicitly reason about objects. We achieve this by performing probabilistic inference using a recurrent neural network that attends to scene elements and processes them one at a time. Crucially, the model itself learns to choose the appropriate number of inference steps. We use this scheme to learn to perform inference in partially specified 2D models (variablesized variational autoencoders) and fully specified 3D models (probabilistic renderers). We show that such models learn to identify multiple objects  counting, locating and classifying the elements of a scene  without any supervision, e.g., decomposing 3D images with various numbers of objects in a single forward pass of a neural network. We further show that the networks produce accurate inferences when compared to supervised counterparts, and that their structure leads to improved generalization.
03/28/2016 ∙ by S. M. Ali Eslami, et al. ∙ 0 ∙ shareread it

Automated Variational Inference in Probabilistic Programming
We present a new algorithm for approximate inference in probabilistic programs, based on a stochastic gradient for variational programs. This method is efficient without restrictions on the probabilistic program; it is particularly practical for distributions which are not analytically tractable, including highly structured distributions that arise in probabilistic programs. We show how to automatically derive meanfield probabilistic programs and optimize them, and demonstrate that our perspective improves inference efficiency over other algorithms.
01/07/2013 ∙ by David Wingate, et al. ∙ 0 ∙ shareread it

Learning and Querying Fast Generative Models for Reinforcement Learning
A key challenge in modelbased reinforcement learning (RL) is to synthesize computationally efficient and accurate environment models. We show that carefully designed generative models that learn and operate on compact state representations, socalled statespace models, substantially reduce the computational costs for predicting outcomes of sequences of actions. Extensive experiments establish that statespace models accurately capture the dynamics of Atari games from the Arcade Learning Environment from raw pixels. The computational speedup of statespace models while maintaining high accuracy makes their application in RL feasible: We demonstrate that agents which query these models for decision making outperform strong modelfree baselines on the game MSPACMAN, demonstrating the potential of using learned environment models for planning.
02/08/2018 ∙ by Lars Buesing, et al. ∙ 0 ∙ shareread it

Learning to Search with MCTSnets
Planning problems are among the most important and wellstudied problems in artificial intelligence. They are most typically solved by tree search algorithms that simulate ahead into the future, evaluate future states, and backup those evaluations to the root of a search tree. Among these algorithms, MonteCarlo tree search (MCTS) is one of the most general, powerful and widely used. A typical implementation of MCTS uses cleverly designed rules, optimized to the particular characteristics of the domain. These rules control where the simulation traverses, what to evaluate in the states that are reached, and how to backup those evaluations. In this paper we instead learn where, what and how to search. Our architecture, which we call an MCTSnet, incorporates simulationbased search inside a neural network, by expanding, evaluating and backingup a vector embedding. The parameters of the network are trained endtoend using gradientbased optimisation. When applied to small searches in the well known planning problem Sokoban, the learned search algorithm significantly outperformed MCTS baselines.
02/13/2018 ∙ by Arthur Guez, et al. ∙ 0 ∙ shareread it

Woulda, Coulda, Shoulda: CounterfactuallyGuided Policy Search
Learning policies on data synthesized by models can in principle quench the thirst of reinforcement learning algorithms for large amounts of real experience, which is often costly to acquire. However, simulating plausible experience de novo is a hard problem for many complex environments, often resulting in biases for modelbased policy evaluation and search. Instead of de novo synthesis of data, here we assume logged, real experience and model alternative outcomes of this experience under counterfactual actions, actions that were not actually taken. Based on this, we propose the CounterfactuallyGuided Policy Search (CFGPS) algorithm for learning policies in POMDPs from offpolicy experience. It leverages structural causal models for counterfactual evaluation of arbitrary policies on individual offpolicy episodes. CFGPS can improve on vanilla modelbased RL algorithms by making use of available logged data to debias model predictions. In contrast to offpolicy algorithms based on Importance Sampling which reweight data, CFGPS leverages a model to explicitly consider alternative outcomes, allowing the algorithm to make better use of experience data. We find empirically that these advantages translate into improved policy evaluation and search results on a nontrivial gridworld task. Finally, we show that CFGPS generalizes the previously proposed Guided Policy Search and that reparameterizationbased algorithms such Stochastic Value Gradient can be interpreted as counterfactual methods.
11/15/2018 ∙ by Lars Buesing, et al. ∙ 0 ∙ shareread it
Theophane Weber
is this you? claim profile