
Doubly Reparameterized Gradient Estimators for Monte Carlo Objectives
Deep latent variable models have become a popular model choice due to the scalable learning algorithms introduced by (Kingma & Welling, 2013; Rezende et al., 2014). These approaches maximize a variational lower bound on the intractable log likelihood of the observed data. Burda et al. (2015) introduced a multisample variational bound, IWAE, that is at least as tight as the standard variational lower bound and becomes increasingly tight as the number of samples increases. Counterintuitively, the typical inference network gradient estimator for the IWAE bound performs poorly as the number of samples increases (Rainforth et al., 2018; Le et al., 2018). Roeder et al. (2017) propose an improved gradient estimator, however, are unable to show it is unbiased. We show that it is in fact biased and that the bias can be estimated efficiently with a second application of the reparameterization trick. The doubly reparameterized gradient (DReG) estimator does not suffer as the number of samples increases, resolving the previously raised issues. The same idea can be used to improve many recently introduced training techniques for latent variable models. In particular, we show that this estimator reduces the variance of the IWAE gradient, the reweighted wakesleep update (RWS) (Bornschein & Bengio, 2014), and the jackknife variational inference (JVI) gradient (Nowozin, 2018). Finally, we show that this computationally efficient, unbiased dropin gradient estimator translates to improved performance for all three objectives on several modeling tasks.
10/09/2018 ∙ by George Tucker, et al. ∙ 18 ∙ shareread it

NearOptimal Representation Learning for Hierarchical Reinforcement Learning
We study the problem of representation learning in goalconditioned hierarchical reinforcement learning. In such hierarchical structures, a higherlevel controller solves tasks by iteratively communicating goals which a lowerlevel policy is trained to reach. Accordingly, the choice of representation  the mapping of observation space to goal space  is crucial. To study this problem, we develop a notion of suboptimality of a representation, defined in terms of expected reward of the optimal hierarchical policy using this representation. We derive expressions which bound the suboptimality and show how these expressions can be translated to representation learning objectives which may be optimized in practice. Results on a number of difficult continuouscontrol tasks show that our approach to representation learning yields qualitatively better representations as well as quantitatively better hierarchical policies, compared to existing methods (see videos at https://sites.google.com/view/representationhrl).
10/02/2018 ∙ by Ofir Nachum, et al. ∙ 12 ∙ shareread it

Way OffPolicy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog
Most deep reinforcement learning (RL) systems are not able to learn effectively from offpolicy data, especially if they cannot explore online in the environment. These are critical shortcomings for applying RL to realworld problems where collecting data is expensive, and models must be tested offline before being deployed to interact with the environment  e.g. systems that learn from human interaction. Thus, we develop a novel class of offpolicy batch RL algorithms, which are able to effectively learn offline, without exploring, from a fixed batch of human interaction data. We leverage models pretrained on data as a strong prior, and use KLcontrol to penalize divergence from this prior during RL training. We also use dropoutbased uncertainty estimates to lower bound the target Qvalues as a more efficient alternative to Double QLearning. The algorithms are tested on the problem of opendomain dialog generation  a challenging reinforcement learning problem with a 20,000dimensional action space. Using our Way OffPolicy algorithm, we can extract multiple different reward functions posthoc from collected human interaction data, and learn effectively from all of these. We test the realworld generalization of these systems by deploying them live to converse with humans in an opendomain setting, and demonstrate that our algorithm achieves significant improvements over prior methods in offpolicy batch RL.
06/30/2019 ∙ by Natasha Jaques, et al. ∙ 9 ∙ shareread it

Language as an Abstraction for Hierarchical Deep Reinforcement Learning
Solving complex, temporallyextended tasks is a longstanding problem in reinforcement learning (RL). We hypothesize that one critical element of solving such problems is the notion of compositionality. With the ability to learn concepts and subskills that can be composed to solve longer tasks, i.e. hierarchical RL, we can acquire temporallyextended behaviors. However, acquiring effective yet general abstractions for hierarchical RL is remarkably challenging. In this paper, we propose to use language as the abstraction, as it provides unique compositional structure, enabling fast learning and combinatorial generalization, while retaining tremendous flexibility, making it suitable for a variety of problems. Our approach learns an instructionfollowing lowlevel policy and a highlevel policy that can reuse abstractions across tasks, in essence, permitting agents to reason using structured language. To study compositional task learning, we introduce an opensource object interaction environment built using the MuJoCo physics engine and the CLEVR engine. We find that, using our approach, agents can learn to solve to diverse, temporallyextended tasks such as object sorting and multiobject rearrangement, including from raw pixel observations. Our analysis find that the compositional nature of language is critical for learning diverse subskills and systematically generalizing to new subskills in comparison to noncompositional abstractions that use the same supervision.
06/18/2019 ∙ by Yiding Jiang, et al. ∙ 9 ∙ shareread it

MultiAgent Manipulation via Locomotion using Hierarchical Sim2Real
Manipulation and locomotion are closely related problems that are often studied in isolation. In this work, we study the problem of coordinating multiple mobile agents to exhibit manipulation behaviors using a reinforcement learning (RL) approach. Our method hinges on the use of hierarchical sim2real  a simulated environment is used to learn lowlevel goalreaching skills, which are then used as the action space for a highlevel RL controller, also trained in simulation. The full hierarchical policy is then transferred to the real world in a zeroshot fashion. The application of domain randomization during training enables the learned behaviors to generalize to realworld settings, while the use of hierarchy provides a modular paradigm for learning and transferring increasingly complex behaviors. We evaluate our method on a number of realworld tasks, including coordinated object manipulation in a multiagent setting. See videos at https://sites.google.com/view/manipulationvialocomotion
08/13/2019 ∙ by Ofir Nachum, et al. ∙ 7 ∙ shareread it

MuProp: Unbiased Backpropagation for Stochastic Neural Networks
Deep neural networks are powerful parametric models that can be trained efficiently using the backpropagation algorithm. Stochastic neural networks combine the power of large parametric functions with that of graphical models, which makes it possible to learn very complex distributions. However, as backpropagation is not directly applicable to stochastic networks that include discrete sampling operations within their computational graph, training such networks remains difficult. We present MuProp, an unbiased gradient estimator for stochastic networks, designed to make this task easier. MuProp improves on the likelihoodratio estimator by reducing its variance using a control variate based on the firstorder Taylor expansion of a meanfield network. Crucially, unlike prior attempts at using backpropagation for training stochastic networks, the resulting estimator is unbiased and well behaved. Our experiments on structured output prediction and discrete latent variable modeling demonstrate that MuProp yields consistently good performance across a range of difficult tasks.
11/16/2015 ∙ by Shixiang Gu, et al. ∙ 0 ∙ shareread it

Interpolated Policy Gradient: Merging OnPolicy and OffPolicy Gradient Estimation for Deep Reinforcement Learning
Offpolicy modelfree deep reinforcement learning methods using previously collected data can improve sample efficiency over onpolicy policy gradient techniques. On the other hand, onpolicy algorithms are often more stable and easier to use. This paper examines, both theoretically and empirically, approaches to merging on and offpolicy updates for deep reinforcement learning. Theoretical results show that offpolicy updates with a value function estimator can be interpolated with onpolicy policy gradient updates whilst still satisfying performance bounds. Our analysis uses control variate methods to produce a family of policy gradient algorithms, with several recently proposed algorithms being special cases of this family. We then provide an empirical comparison of these techniques with the remaining algorithmic details fixed, and show how different mixing of offpolicy gradient estimates with onpolicy samples contribute to improvements in empirical performance. The final algorithm provides a generalization and unification of existing deep policy gradient techniques, has theoretical guarantees on the bias introduced by offpolicy updates, and improves on the stateoftheart modelfree deep RL methods on a number of OpenAI Gym continuous control benchmarks.
06/01/2017 ∙ by Shixiang Gu, et al. ∙ 0 ∙ shareread it

Sequence Tutor: Conservative FineTuning of Sequence Generation Models with KLcontrol
This paper proposes a general method for improving the structure and quality of sequences generated by a recurrent neural network (RNN), while maintaining information originally learned from data, as well as sample diversity. An RNN is first pretrained on data using maximum likelihood estimation (MLE), and the probability distribution over the next token in the sequence learned by this model is treated as a prior policy. Another RNN is then trained using reinforcement learning (RL) to generate higherquality outputs that account for domainspecific incentives while retaining proximity to the prior policy of the MLE RNN. To formalize this objective, we derive novel offpolicy RL methods for RNNs from KLcontrol. The effectiveness of the approach is demonstrated on two applications; 1) generating novel musical melodies, and 2) computational molecular generation. For both problems, we show that the proposed method improves the desired properties and structure of the generated sequences, while maintaining information learned from data.
11/09/2016 ∙ by Natasha Jaques, et al. ∙ 0 ∙ shareread it

Deep Reinforcement Learning for Robotic Manipulation with Asynchronous OffPolicy Updates
Reinforcement learning holds the promise of enabling autonomous robots to learn large repertoires of behavioral skills with minimal human intervention. However, robotic applications of reinforcement learning often compromise the autonomy of the learning process in favor of achieving training times that are practical for real physical systems. This typically involves introducing handengineered policy representations and humansupplied demonstrations. Deep reinforcement learning alleviates this limitation by training generalpurpose neural network policies, but applications of direct deep reinforcement learning algorithms have so far been restricted to simulated settings and relatively simple tasks, due to their apparent high sample complexity. In this paper, we demonstrate that a recent deep reinforcement learning algorithm based on offpolicy training of deep Qfunctions can scale to complex 3D manipulation tasks and can learn deep neural network policies efficiently enough to train on real physical robots. We demonstrate that the training times can be further reduced by parallelizing the algorithm across multiple robots which pool their policy updates asynchronously. Our experimental evaluation shows that our method can learn a variety of 3D manipulation skills in simulation and a complex door opening skill on real robots without any prior demonstrations or manually designed representations.
10/03/2016 ∙ by Shixiang Gu, et al. ∙ 0 ∙ shareread it

Continuous Deep QLearning with Modelbased Acceleration
Modelfree reinforcement learning has been successfully applied to a range of challenging problems, and has recently been extended to handle large neural network policies and value functions. However, the sample complexity of modelfree algorithms, particularly when using highdimensional function approximators, tends to limit their applicability to physical systems. In this paper, we explore algorithms and representations to reduce the sample complexity of deep reinforcement learning for continuous control tasks. We propose two complementary techniques for improving the efficiency of such algorithms. First, we derive a continuous variant of the Qlearning algorithm, which we call normalized adantage functions (NAF), as an alternative to the more commonly used policy gradient and actorcritic methods. NAF representation allows us to apply Qlearning with experience replay to continuous tasks, and substantially improves performance on a set of simulated robotic control tasks. To further improve the efficiency of our approach, we explore the use of learned models for accelerating modelfree reinforcement learning. We show that iteratively refitted local linear models are especially effective for this, and demonstrate substantially faster learning on domains where such models are applicable.
03/02/2016 ∙ by Shixiang Gu, et al. ∙ 0 ∙ shareread it

Categorical Reparameterization with GumbelSoftmax
Categorical variables are a natural choice for representing discrete structure in the world. However, stochastic neural networks rarely use categorical latent variables due to the inability to backpropagate through samples. In this work, we present an efficient gradient estimator that replaces the nondifferentiable sample from a categorical distribution with a differentiable sample from a novel GumbelSoftmax distribution. This distribution has the essential property that it can be smoothly annealed into a categorical distribution. We show that our GumbelSoftmax estimator outperforms stateoftheart gradient estimators on structured output prediction and unsupervised generative modeling tasks with categorical latent variables, and enables large speedups on semisupervised classification.
11/03/2016 ∙ by Eric Jang, et al. ∙ 0 ∙ shareread it
Shixiang Gu
is this you? claim profile
Research Intern at Google, Ph.D. candidate and Research Assistant at University of Cambridge