
Training language GANs from Scratch
Generative Adversarial Networks (GANs) enjoy great success at image generation, but have proven difficult to train in the domain of natural language. Challenges with gradient estimation, optimization instability, and mode collapse have lead practitioners to resort to maximum likelihood pretraining, followed by small amounts of adversarial finetuning. The benefits of GAN finetuning for language generation are unclear, as the resulting models produce comparable or worse samples than traditional language models. We show it is in fact possible to train a language GAN from scratch  without maximum likelihood pretraining. We combine existing techniques such as large batch sizes, dense rewards and discriminator regularization to stabilize and improve language GANs. The resulting model, ScratchGAN, performs comparably to maximum likelihood training on EMNLP2017 News and WikiText103 corpora according to quality and diversity metrics.
05/23/2019 ∙ by Cyprien de Masson d'Autume, et al. ∙ 14 ∙ shareread it

Monte Carlo Gradient Estimation in Machine Learning
This paper is a broad and accessible survey of the methods we have at our disposal for Monte Carlo gradient estimation in machine learning and across the statistical sciences: the problem of computing the gradient of an expectation of a function with respect to parameters defining the distribution that is integrated; the problem of sensitivity analysis. In machine learning research, this gradient problem lies at the core of many learning problems, in supervised, unsupervised and reinforcement learning. We will generally seek to rewrite such gradients in a form that allows for Monte Carlo estimation, allowing them to be easily and efficiently used and analysed. We explore three strategiesthe pathwise, score function, and measurevalued gradient estimatorsexploring their historical developments, derivation, and underlying assumptions. We describe their use in other fields, show how they are related and can be combined, and expand on their possible generalisations. Wherever Monte Carlo gradient estimators have been derived and deployed in the past, important advances have followed. A deeper and more widelyheld understanding of this problem will lead to further advances, and it is these advances that we wish to support.
06/25/2019 ∙ by Shakir Mohamed, et al. ∙ 10 ∙ shareread it

Learning Implicit Generative Models with the Method of Learned Moments
We propose a method of moments (MoM) algorithm for training largescale implicit generative models. Moment estimation in this setting encounters two problems: it is often difficult to define the millions of moments needed to learn the model parameters, and it is hard to determine which properties are useful when specifying moments. To address the first issue, we introduce a moment network, and define the moments as the network's hidden units and the gradient of the network's output with the respect to its parameters. To tackle the second problem, we use asymptotic theory to highlight desiderata for moments  namely they should minimize the asymptotic variance of estimated model parameters  and introduce an objective to learn better moments. The sequence of objectives created by this Method of Learned Moments (MoLM) can train highquality neural image samplers. On CIFAR10, we demonstrate that MoLMtrained generators achieve significantly higher Inception Scores and lower Frechet Inception Distances than those trained with gradient penaltyregularized and spectrallynormalized adversarial objectives. These generators also achieve nearly perfect MultiScale Structural Similarity Scores on CelebA, and can create highquality samples of 128x128 images.
06/28/2018 ∙ by Suman Ravuri, et al. ∙ 2 ∙ shareread it

Many Paths to Equilibrium: GANs Do Not Need to Decrease a Divergence At Every Step
Generative adversarial networks (GANs) are a family of generative models that do not minimize a single training criterion. Unlike other generative models, the data distribution is learned via a game between a generator (the generative model) and a discriminator (a teacher providing training signal) that each minimize their own cost. GANs are designed to reach a Nash equilibrium at which each player cannot reduce their cost without changing the other players' parameters. One useful approach for the theory of GANs is to show that a divergence between the training distribution and the model distribution obtains its minimum value at equilibrium. Several recent research directions have been motivated by the idea that this divergence is the primary guide for the learning process and that every step of learning should decrease the divergence. We show that this view is overly restrictive. During GAN training, the discriminator provides learning signal in situations where the gradients of the divergences between distributions would not be useful. We provide empirical counterexamples to the view of GAN training as divergence minimization. Specifically, we demonstrate that GANs are able to learn distributions in situations where the divergence minimization point of view predicts they would fail. We also show that gradient penalties motivated from the divergence minimization perspective are equally helpful when applied in other contexts in which the divergence minimization perspective does not predict they would be helpful. This contributes to a growing body of evidence that GAN training may be more usefully viewed as approaching Nash equilibria via trajectories that do not necessarily minimize a specific divergence at each step.
10/23/2017 ∙ by William Fedus, et al. ∙ 0 ∙ shareread it

Variational Approaches for AutoEncoding Generative Adversarial Networks
Autoencoding generative adversarial networks (GANs) combine the standard GAN algorithm, which discriminates between real and modelgenerated data, with a reconstruction loss given by an autoencoder. Such models aim to prevent mode collapse in the learned generative model by ensuring that it is grounded in all the available training data. In this paper, we develop a principle upon which autoencoders can be combined with generative adversarial networks by exploiting the hierarchical structure of the generative model. The underlying principle shows that variational inference can be used a basic tool for learning, but with the in tractable likelihood replaced by a synthetic likelihood, and the unknown posterior distribution replaced by an implicit distribution; both synthetic likelihoods and implicit posterior distributions can be learned using discriminators. This allows us to develop a natural fusion of variational autoencoders and generative adversarial networks, combining the best of both these methods. We describe a unified objective for optimization, discuss the constraints needed to guide learning, connect to the wide range of existing work, and use a battery of tests to systematically and quantitatively assess the performance of our method.
06/15/2017 ∙ by Mihaela Rosca, et al. ∙ 0 ∙ shareread it

The Cramer Distance as a Solution to Biased Wasserstein Gradients
The Wasserstein probability metric has received much attention from the machine learning community. Unlike the KullbackLeibler divergence, which strictly measures change in probability, the Wasserstein metric reflects the underlying geometry between outcomes. The value of being sensitive to this geometry has been demonstrated, among others, in ordinal regression and generative modelling. In this paper we describe three natural properties of probability divergences that reflect requirements from machine learning: sum invariance, scale sensitivity, and unbiased sample gradients. The Wasserstein metric possesses the first two properties but, unlike the KullbackLeibler divergence, does not possess the third. We provide empirical evidence suggesting that this is a serious issue in practice. Leveraging insights from probabilistic forecasting we propose an alternative to the Wasserstein metric, the Cramér distance. We show that the Cramér distance possesses all three desired properties, combining the best of the Wasserstein and KullbackLeibler divergences. To illustrate the relevance of the Cramér distance in practice we design a new algorithm, the Cramér Generative Adversarial Network (GAN), and show that it performs significantly better than the related Wasserstein GAN.
05/30/2017 ∙ by Marc G. Bellemare, et al. ∙ 0 ∙ shareread it

Recurrent Environment Simulators
Models that can simulate how environments change in response to actions can be used by agents to plan and act efficiently. We improve on previous environment simulators from highdimensional pixel observations by introducing recurrent neural networks that are able to make temporally and spatially coherent predictions for hundreds of timesteps into the future. We present an indepth analysis of the factors affecting performance, providing the most extensive attempt to advance the understanding of the properties of these models. We address the issue of computationally inefficiency with a model that does not need to generate a highdimensional image at each timestep. We show that our approach can be used to improve exploration and is adaptable to many diverse environments, namely 10 Atari games, a 3D car racing environment, and complex 3D mazes.
04/07/2017 ∙ by Silvia Chiappa, et al. ∙ 0 ∙ shareread it

Generative Temporal Models with Memory
We consider the general problem of modeling temporal data with longrange dependencies, wherein new observations are fully or partially predictable based on temporallydistant, past observations. A sufficiently powerful temporal model should separate predictable elements of the sequence from unpredictable elements, express uncertainty about those unpredictable elements, and rapidly identify novel elements that may help to predict the future. To create such models, we introduce Generative Temporal Models augmented with external memory systems. They are developed within the variational inference framework, which provides both a practical training methodology and methods to gain insight into the models' operation. We show, on a range of problems with sparse, longterm temporal dependencies, that these models store information from early in a sequence, and reuse this stored information efficiently. This allows them to perform substantially better than existing models based on wellknown recurrent neural networks, like LSTMs.
02/15/2017 ∙ by Mevlana Gemici, et al. ∙ 0 ∙ shareread it

Normalizing Flows on Riemannian Manifolds
We consider the problem of density estimation on Riemannian manifolds. Density estimation on manifolds has many applications in fluidmechanics, optics and plasma physics and it appears often when dealing with angular variables (such as used in protein folding, robot limbs, geneexpression) and in general directional statistics. In spite of the multitude of algorithms available for density estimation in the Euclidean spaces R^n that scale to large n (e.g. normalizing flows, kernel methods and variational approximations), most of these methods are not immediately suitable for density estimation in more general Riemannian manifolds. We revisit techniques related to homeomorphisms from differential geometry for projecting densities to submanifolds and use it to generalize the idea of normalizing flows to more general Riemannian manifolds. The resulting algorithm is scalable, simple to implement and suitable for use with automatic differentiation. We demonstrate concrete examples of this method on the nsphere S^n.
11/07/2016 ∙ by Mevlana C. Gemici, et al. ∙ 0 ∙ shareread it

Learning in Implicit Generative Models
Generative adversarial networks (GANs) provide an algorithmic framework for constructing generative models with several appealing properties: they do not require a likelihood function to be specified, only a generating procedure; they provide samples that are sharp and compelling; and they allow us to harness our knowledge of building highly accurate neural network classifiers. Here, we develop our understanding of GANs with the aim of forming a rich view of this growing area of machine learningto build connections to the diverse set of statistical thinking on this topic, of which much can be gained by a mutual exchange of ideas. We frame GANs within the wider landscape of algorithms for learning in implicit generative modelsmodels that only specify a stochastic procedure with which to generate dataand relate these ideas to modelling problems in related fields, such as econometrics and approximate Bayesian computation. We develop likelihoodfree inference methods and highlight hypothesis testing as a principle for learning in implicit generative models, using which we are able to derive the objective function used by GANs, and many other related objectives. The testing viewpoint directs our focus to the general problem of density ratio estimation. There are four approaches for density ratio estimation, one of which is a solution using classifiers to distinguish real from generated data. Other approaches such as divergence minimisation and moment matching have also been explored in the GAN literature, and we synthesise these views to form an understanding in terms of the relationships between them and the wider literature, highlighting avenues for future exploration and crosspollination.
10/11/2016 ∙ by Shakir Mohamed, et al. ∙ 0 ∙ shareread it

Early Visual Concept Learning with Unsupervised Deep Learning
Automated discovery of early visual concepts from raw image data is a major open challenge in AI research. Addressing this problem, we propose an unsupervised approach for learning disentangled representations of the underlying factors of variation. We draw inspiration from neuroscience, and show how this can be achieved in an unsupervised generative model by applying the same learning pressures as have been suggested to act in the ventral visual stream in the brain. By enforcing redundancy reduction, encouraging statistical independence, and exposure to data with transform continuities analogous to those to which human infants are exposed, we obtain a variational autoencoder (VAE) framework capable of learning disentangled factors. Our approach makes few assumptions and works well across a wide variety of datasets. Furthermore, our solution has useful emergent properties, such as zeroshot inference and an intuitive understanding of "objectness".
06/17/2016 ∙ by Irina Higgins, et al. ∙ 0 ∙ shareread it
Shakir Mohamed
is this you? claim profile
Research Scientist at Google DeepMind since 2014, Senior Researcher in Statistical Machine Learning at DeepMind Technologies from 20132014, Junior Research Fellow at the University of British Columbia at Canadian Institute For Advanced Research from 20112013, Business Analyst at Nedbank from 20062007, Postdoctoral researcher in Machine Learning at the University of British Columbia in the Department of Computer Science, Postdoctoral researcher at Laboratory for Computational Intelligence (LCI), Junior fellowship from the Canadian Institute for Advanced Research (CIFAR), within the Neural Computation and Adaptive Perception (NCAP), PhD in the Machine Learning Group at the University of Cambridge Commonwealth Scholar to the United Kingdom and a member of St John's College.