
Bayesian Optimization in AlphaGo
During the development of AlphaGo, its many hyperparameters were tuned with Bayesian optimization multiple times. This automatic tuning process resulted in substantial improvements in playing strength. For example, prior to the match with Lee Sedol, we tuned the latest AlphaGo agent and this improved its winrate from 50 in the final match. Of course, since we tuned AlphaGo many times during its development cycle, the compounded contribution was even higher than this percentage. It is our hope that this brief case study will be of interest to Go fans, and also provide Bayesian optimization practitioners with some insights and inspiration.
12/17/2018 ∙ by Yutian Chen, et al. ∙ 128 ∙ shareread it

Sample Efficient Adaptive TexttoSpeech
We present a metalearning approach for adaptive texttospeech (TTS) with few data. During training, we learn a multispeaker model using a shared conditional WaveNet core and independent learned embeddings for each speaker. The aim of training is not to produce a neural network with fixed weights, which is then deployed as a TTS system. Instead, the aim is to produce a network that requires few data at deployment time to rapidly adapt to new speakers. We introduce and benchmark three strategies: (i) learning the speaker embedding while keeping the WaveNet core fixed, (ii) finetuning the entire architecture with stochastic gradient descent, and (iii) predicting the speaker embedding with a trained neural network encoder. The experiments show that these approaches are successful at adapting the multispeaker neural network to new speakers, obtaining stateoftheart results in both sample naturalness and voice similarity with merely a few minutes of audio data from new speakers.
09/27/2018 ∙ by Yutian Chen, et al. ∙ 2 ∙ shareread it

Learning to Learn without Gradient Descent by Gradient Descent
We learn recurrent neural network optimizers trained on simple synthetic functions by gradient descent. We show that these learned optimizers exhibit a remarkable degree of transfer in that they can be used to efficiently optimize a broad range of derivativefree blackbox functions, including Gaussian process bandits, simple control objectives, global optimization benchmarks and hyperparameter tuning tasks. Up to the training horizon, the learned optimizers learn to tradeoff exploration and exploitation, and compare favourably with heavily engineered Bayesian optimization packages for hyperparameter tuning.
11/11/2016 ∙ by Yutian Chen, et al. ∙ 0 ∙ shareread it

Herding as a Learning System with EdgeofChaos Dynamics
Herding defines a deterministic dynamical system at the edge of chaos. It generates a sequence of model states and parameters by alternating parameter perturbations with state maximizations, where the sequence of states can be interpreted as "samples" from an associated MRF model. Herding differs from maximum likelihood estimation in that the sequence of parameters does not converge to a fixed point and differs from an MCMC posterior sampling approach in that the sequence of states is generated deterministically. Herding may be interpreted as a"perturb and map" method where the parameter perturbations are generated using a deterministic nonlinear dynamical system rather than randomly from a Gumbel distribution. This chapter studies the distinct statistical characteristics of the herding algorithm and shows that the fast convergence rate of the controlled moments may be attributed to edge of chaos dynamics. The herding algorithm can also be generalized to models with latent variables and to a discriminative learning setting. The perceptron cycling theorem ensures that the fast moment matching property is preserved in the more general framework.
02/09/2016 ∙ by Yutian Chen, et al. ∙ 0 ∙ shareread it

Scalable Discrete Sampling as a MultiArmed Bandit Problem
Drawing a sample from a discrete distribution is one of the building components for Monte Carlo methods. Like other sampling algorithms, discrete sampling suffers from the high computational burden in largescale inference problems. We study the problem of sampling a discrete random variable with a high degree of dependency that is typical in largescale Bayesian inference and graphical models, and propose an efficient approximate solution with a subsampling approach. We make a novel connection between the discrete sampling and MultiArmed Bandits problems with a finite reward population and provide three algorithms with theoretical guarantees. Empirical evaluations show the robustness and efficiency of the approximate algorithms in both synthetic and realworld largescale problems.
06/30/2015 ∙ by Yutian Chen, et al. ∙ 0 ∙ shareread it

Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data
Multivariate categorical data occur in many applications of machine learning. One of the main difficulties with these vectors of categorical variables is sparsity. The number of possible observations grows exponentially with vector length, but dataset diversity might be poor in comparison. Recent models have gained significant improvement in supervised tasks with this data. These models embed observations in a continuous space to capture similarities between them. Building on these ideas we propose a Bayesian model for the unsupervised task of distribution estimation of multivariate categorical data. We model vectors of categorical variables as generated from a nonlinear transformation of a continuous latent space. Nonlinearity captures multimodality in the distribution. The continuous representation addresses sparsity. Our model ties together many existing models, linking the linear categorical latent Gaussian model, the Gaussian process latent variable model, and Gaussian process classification. We derive inference for our model based on recent developments in sampling based variational inference. We show empirically that the model outperforms its linear and discrete counterparts in imputation tasks of sparse data.
03/07/2015 ∙ by Yarin Gal, et al. ∙ 0 ∙ shareread it

SublinearTime Approximate MCMC Transitions for Probabilistic Programs
Probabilistic programming languages can simplify the development of machine learning techniques, but only if inference is sufficiently scalable. Unfortunately, Bayesian parameter estimation for highly coupled models such as regressions and statespace models still scales poorly; each MCMC transition takes linear time in the number of observations. This paper describes a sublineartime algorithm for making MetropolisHastings (MH) updates to latent variables in probabilistic programs. The approach generalizes recently introduced approximate MH techniques: instead of subsampling data items assumed to be independent, it subsamples edges in a dynamically constructed graphical model. It thus applies to a broader class of problems and interoperates with other generalpurpose inference techniques. Empirical results, including confirmation of sublinear pertransition scaling, are presented for Bayesian logistic regression, nonlinear classification via joint Dirichlet process mixtures, and parameter estimation for stochastic volatility models (with state estimation via particle MCMC). All three applications use the same implementation, and each requires under 20 lines of probabilistic code.
11/06/2014 ∙ by Yutian Chen, et al. ∙ 0 ∙ shareread it

Bayesian Structure Learning for Markov Random Fields with a Spike and Slab Prior
In recent years a number of methods have been developed for automatically learning the (sparse) connectivity structure of Markov Random Fields. These methods are mostly based on L1regularized optimization which has a number of disadvantages such as the inability to assess model uncertainty and expensive crossvalidation to find the optimal regularization parameter. Moreover, the model's predictive performance may degrade dramatically with a suboptimal value of the regularization parameter (which is sometimes desirable to induce sparseness). We propose a fully Bayesian approach based on a "spike and slab" prior (similar to L0 regularization) that does not suffer from these shortcomings. We develop an approximate MCMC method combining Langevin dynamics and reversible jump MCMC to conduct inference in this model. Experiments show that the proposed model learns a good combination of the structure and parameter values without the need for separate hyperparameter tuning. Moreover, the model's predictive performance is much more robust than L1based methods with hyperparameter settings that induce highly sparse model structures.
08/09/2014 ∙ by Yutian Chen, et al. ∙ 0 ∙ shareread it

Austerity in MCMC Land: Cutting the MetropolisHastings Budget
Can we make Bayesian posterior MCMC sampling more efficient when faced with very large datasets? We argue that computing the likelihood for N datapoints in the MetropolisHastings (MH) test to reach a single binary decision is computationally inefficient. We introduce an approximate MH rule based on a sequential hypothesis test that allows us to accept or reject samples with high confidence using only a fraction of the data required for the exact MH rule. While this method introduces an asymptotic bias, we show that this bias can be controlled and is more than offset by a decrease in variance due to our ability to draw more samples per unit of time.
04/19/2013 ∙ by Anoop Korattikara, et al. ∙ 0 ∙ shareread it

Herded Gibbs Sampling
The Gibbs sampler is one of the most popular algorithms for inference in statistical models. In this paper, we introduce a herding variant of this algorithm, called herded Gibbs, that is entirely deterministic. We prove that herded Gibbs has an O(1/T) convergence rate for models with independent variables and for fully connected probabilistic graphical models. Herded Gibbs is shown to outperform Gibbs in the tasks of image denoising with MRFs and named entity recognition with CRFs. However, the convergence for herded Gibbs for sparsely connected probabilistic graphical models is still an open problem.
01/17/2013 ∙ by Luke Bornn, et al. ∙ 0 ∙ shareread it

SuperSamples from Kernel Herding
We extend the herding algorithm to continuous spaces by using the kernel trick. The resulting "kernel herding" algorithm is an infinite memory deterministic process that learns to approximate a PDF with a collection of samples. We show that kernel herding decreases the error of expectations of functions in the Hilbert space at a rate O(1/T) which is much faster than the usual O(1/pT) for iid random samples. We illustrate kernel herding by approximating Bayesian predictive distributions.
03/15/2012 ∙ by Yutian Chen, et al. ∙ 0 ∙ shareread it
Yutian Chen
is this you? claim profile
Senior Research Scientist at DeepMind. Earned his B.E. of Electronic Engineering from Tsinghua University in China in June 2007. Then I went to the University of California, Irvine in the United States to pursue a doctoral degree.