
An investigation of modelfree planning
The field of reinforcement learning (RL) is facing increasingly challenging domains with combinatorial complexity. For an RL agent to address these challenges, it is essential that it can plan effectively. Prior work has typically utilized an explicit model of the environment, combined with a specific planning algorithm (such as tree search). More recently, a new family of methods have been proposed that learn how to plan, by providing the structure for planning via an inductive bias in the function approximator (such as a tree structured neural network), trained endtoend by a modelfree RL algorithm. In this paper, we go even further, and demonstrate empirically that an entirely modelfree approach, without special structure beyond standard neural network components such as convolutional networks and LSTMs, can learn to exhibit many of the characteristics typically associated with a modelbased planner. We measure our agent's effectiveness at planning in terms of its ability to generalize across a combinatorial and irreversible state space, its data efficiency, and its ability to utilize additional thinking time. We find that our agent has many of the characteristics that one might expect to find in a planning algorithm. Furthermore, it exceeds the stateoftheart in challenging combinatorial domains such as Sokoban and outperforms other modelfree approaches that utilize strong inductive biases toward planning.
01/11/2019 ∙ by Arthur Guez, et al. ∙ 10 ∙ shareread it

ImaginationAugmented Agents for Deep Reinforcement Learning
We introduce ImaginationAugmented Agents (I2As), a novel architecture for deep reinforcement learning combining modelfree and modelbased aspects. In contrast to most existing modelbased reinforcement learning and planning methods, which prescribe how a model should be used to arrive at a policy, I2As learn to interpret predictions from a learned environment model to construct implicit plans in arbitrary ways, by using the predictions as additional context in deep policy networks. I2As show improved data efficiency, performance, and robustness to model misspecification compared to several baselines.
07/19/2017 ∙ by Theophane Weber, et al. ∙ 0 ∙ shareread it

Learning modelbased planning from scratch
Conventional wisdom holds that modelbased planning is a powerful approach to sequential decisionmaking. It is often very challenging in practice, however, because while a model can be used to evaluate a plan, it does not prescribe how to construct a plan. Here we introduce the "Imaginationbased Planner", the first modelbased, sequential decisionmaking agent that can learn to construct, evaluate, and execute plans. Before any action, it can perform a variable number of imagination steps, which involve proposing an imagined action and evaluating it with its modelbased imagination. All imagined actions and outcomes are aggregated, iteratively, into a "plan context" which conditions future real and imagined actions. The agent can even decide how to imagine: testing out alternative imagined actions, chaining sequences of actions together, or building a more complex "imagination tree" by navigating flexibly among the previously imagined states using a learned policy. And our agent can learn to plan economically, jointly optimizing for external rewards and computational costs associated with using its imagination. We show that our architecture can learn to solve a challenging continuous control problem, and also learn elaborate planning strategies in a discrete mazesolving task. Our work opens a new direction toward learning the components of a modelbased planning system and how to use them.
07/19/2017 ∙ by Razvan Pascanu, et al. ∙ 0 ∙ shareread it

Recurrent Environment Simulators
Models that can simulate how environments change in response to actions can be used by agents to plan and act efficiently. We improve on previous environment simulators from highdimensional pixel observations by introducing recurrent neural networks that are able to make temporally and spatially coherent predictions for hundreds of timesteps into the future. We present an indepth analysis of the factors affecting performance, providing the most extensive attempt to advance the understanding of the properties of these models. We address the issue of computationally inefficiency with a model that does not need to generate a highdimensional image at each timestep. We show that our approach can be used to improve exploration and is adaptable to many diverse environments, namely 10 Atari games, a 3D car racing environment, and complex 3D mazes.
04/07/2017 ∙ by Silvia Chiappa, et al. ∙ 0 ∙ shareread it

Learning and Querying Fast Generative Models for Reinforcement Learning
A key challenge in modelbased reinforcement learning (RL) is to synthesize computationally efficient and accurate environment models. We show that carefully designed generative models that learn and operate on compact state representations, socalled statespace models, substantially reduce the computational costs for predicting outcomes of sequences of actions. Extensive experiments establish that statespace models accurately capture the dynamics of Atari games from the Arcade Learning Environment from raw pixels. The computational speedup of statespace models while maintaining high accuracy makes their application in RL feasible: We demonstrate that agents which query these models for decision making outperform strong modelfree baselines on the game MSPACMAN, demonstrating the potential of using learned environment models for planning.
02/08/2018 ∙ by Lars Buesing, et al. ∙ 0 ∙ shareread it

The Mechanics of nPlayer Differentiable Games
The cornerstone underpinning deep learning is the guarantee that gradient descent on an objective converges to local minima. Unfortunately, this guarantee fails in settings, such as generative adversarial nets, where there are multiple interacting losses. The behavior of gradientbased methods in games is not well understood  and is becoming increasingly important as adversarial and multiobjective architectures proliferate. In this paper, we develop new techniques to understand and control the dynamics in general games. The key result is to decompose the secondorder dynamics into two components. The first is related to potential games, which reduce to gradient descent on an implicit function; the second relates to Hamiltonian games, a new class of games that obey a conservation law, akin to conservation laws in classical mechanical systems. The decomposition motivates Symplectic Gradient Adjustment (SGA), a new algorithm for finding stable fixed points in general games. Basic experiments show SGA is competitive with recently proposed algorithms for finding local Nash equilibria in GANs  whilst at the same time being applicable to  and having guarantees in  much more general games.
02/15/2018 ∙ by David Balduzzi, et al. ∙ 0 ∙ shareread it

Woulda, Coulda, Shoulda: CounterfactuallyGuided Policy Search
Learning policies on data synthesized by models can in principle quench the thirst of reinforcement learning algorithms for large amounts of real experience, which is often costly to acquire. However, simulating plausible experience de novo is a hard problem for many complex environments, often resulting in biases for modelbased policy evaluation and search. Instead of de novo synthesis of data, here we assume logged, real experience and model alternative outcomes of this experience under counterfactual actions, actions that were not actually taken. Based on this, we propose the CounterfactuallyGuided Policy Search (CFGPS) algorithm for learning policies in POMDPs from offpolicy experience. It leverages structural causal models for counterfactual evaluation of arbitrary policies on individual offpolicy episodes. CFGPS can improve on vanilla modelbased RL algorithms by making use of available logged data to debias model predictions. In contrast to offpolicy algorithms based on Importance Sampling which reweight data, CFGPS leverages a model to explicitly consider alternative outcomes, allowing the algorithm to make better use of experience data. We find empirically that these advantages translate into improved policy evaluation and search results on a nontrivial gridworld task. Finally, we show that CFGPS generalizes the previously proposed Guided Policy Search and that reparameterizationbased algorithms such Stochastic Value Gradient can be interpreted as counterfactual methods.
11/15/2018 ∙ by Lars Buesing, et al. ∙ 0 ∙ shareread it

Towards a Definition of Disentangled Representations
How can intelligent agents solve a diverse set of tasks in a dataefficient manner? The disentangled representation learning approach posits that such an agent would benefit from separating out (disentangling) the underlying structure of the world into disjoint parts of its representation. However, there is no generally agreedupon definition of disentangling, not least because it is unclear how to formalise the notion of world structure beyond toy datasets with a known ground truth generative process. Here we propose that a principled solution to characterising disentangled representations can be found by focusing on the transformation properties of the world. In particular, we suggest that those transformations that change only some properties of the underlying world state, while leaving all other properties invariant, are what gives exploitable structure to any kind of data. Similar ideas have already been successfully applied in physics, where the study of symmetry transformations has revolutionised the understanding of the world structure. By connecting symmetry transformations to vector representations using the formalism of group and representation theory we arrive at the first formal definition of disentangled representations. Our new definition is in agreement with many of the current intuitions about disentangling, while also providing principled resolutions to a number of previous points of contention. While this work focuses on formally defining disentangling  as opposed to solving the learning problem  we believe that the shift in perspective to studying data transformations can stimulate the development of better representation learning algorithms.
12/05/2018 ∙ by Irina Higgins, et al. ∙ 0 ∙ shareread it

Differentiable Game Mechanics
Deep learning is built on the foundational guarantee that gradient descent on an objective function converges to local minima. Unfortunately, this guarantee fails in settings, such as generative adversarial nets, that exhibit multiple interacting losses. The behavior of gradientbased methods in games is not well understood  and is becoming increasingly important as adversarial and multiobjective architectures proliferate. In this paper, we develop new tools to understand and control the dynamics in nplayer differentiable games. The key result is to decompose the game Jacobian into two components. The first, symmetric component, is related to potential games, which reduce to gradient descent on an implicit function. The second, antisymmetric component, relates to Hamiltonian games, a new class of games that obey a conservation law akin to conservation laws in classical mechanical systems. The decomposition motivates Symplectic Gradient Adjustment (SGA), a new algorithm for finding stable fixed points in differentiable games. Basic experiments show SGA is competitive with recently proposed algorithms for finding stable fixed points in GANs  while at the same time being applicable to, and having guarantees in, much more general cases.
05/13/2019 ∙ by Alistair Letcher, et al. ∙ 0 ∙ shareread it
Sebastien Racanière
is this you? claim profile