Zoubin Ghahramani

is this you? claim profile


Zoubin Ghahramani FRS is a British-Iranian researcher at Cambridge University and Professor of Information Engineering. He is jointly appointed to University College London as well as to the Alan Turing Institute. And since 2009, he has been a fellow of St John’s College, Cambridge. From 2003 to 2012, he was Associate Professor of Research at the Carnegie Mellon School of Computing Science. He is also the Chief Scientist of Uber and the Vice Director of the Leverhulme Centre.

Ghahramani received a degree in cognitive science and computer science from the American school of Madrid in Spain and the University of Pennsylvania in 1990. He obtained his PhD from Michael I. Jordan and Tomaso Poggio’s Department of Brain and Cognitive Science at the Massachusetts Institute of Technology.

After his PhD, Ghahramani moved to the University of Toronto in 1995, working with Geoffrey Hinton, where he was an ITRC Postdoctoral Fellow at the Artificial Intelligence Lab. He was a faculty member at the Gatsby Computational Neuroscience Unit at University College London from 1998 to 2005.

In the field of Bayesian machine learning, Ghahramani made significant contributions as well as in graphical models and computer science. His current research focuses on Bayesian non-parametric modeling and statistical machine learning. He has also been working on artificial intelligence, information collection, bioinformatics and statistics, which are the basis for the management of uncertainty, decision-making and the design of learning systems. He has publicated more than 200 documents, receiving over 30,000 quotes. In 2014, he and Gary Marcus, Doug Bemis and Ken Stanley co-founded the Geometric Intelligence Company. In 2016, he moved to Uber’s A.I. Labs after Uber had acquired the startup. He became Chief Scientist just after four months, replacing Gary Marcus.

In 2015, Ghahramani was elected Royal Society Fellow. His election certificate reads:

Zoubin Ghahramani is a world leader in machine learning, which makes significant progress in algorithms that can learn from data. In particular, it is known for its fundamental contributions in probabilistic modeling and bayesian nonparametric approaches to machine learning systems and the development of approximate algorithms for scalable learning. He is a pioneer of SML methods, active learning algorithms and sparse Gaussian processes. His development of novel non-parametric dimensional models, such as the infinite latent model, was highly influential.

  • Efficient and Robust Machine Learning for Real-World Systems

    While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation and the vision of the Internet-of-Things fuel the interest in resource efficient approaches. These approaches require a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. On top of this, it is crucial to treat uncertainty in a consistent manner in all but the simplest applications of machine learning systems. In particular, a desideratum for any real-world system is to be robust in the presence of outliers and corrupted data, as well as being `aware' of its limits, i.e. the system should maintain and provide an uncertainty estimate over its own predictions. These complex demands are among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology into every day's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. First we provide a comprehensive review of resource-efficiency in deep neural networks with focus on techniques for model size reduction, compression and reduced precision. These techniques can be applied during training or as post-processing and are widely used to reduce both computational complexity and memory footprint. As most (practical) neural networks are limited in their ways to treat uncertainty, we contrast them with probabilistic graphical models, which readily serve these desiderata by means of probabilistic inference. In that way, we provide an extensive overview of the current state-of-the-art of robust and efficient machine learning for real-world systems.

    12/05/2018 ∙ by Franz Pernkopf, et al. ∙ 16 share

    read it

  • Antithetic and Monte Carlo kernel estimators for partial rankings

    In the modern age, rankings data is ubiquitous and it is useful for a variety of applications such as recommender systems, multi-object tracking and preference learning. However, most rankings data encountered in the real world is incomplete, which forbids the direct application of existing modelling tools for complete rankings. Our contribution is a novel way to extend kernel methods for complete rankings to partial rankings, via consistent Monte Carlo estimators of Gram matrices. These Monte Carlo kernel estimators are based on extending kernel mean embeddings to the embedding of a set of full rankings consistent with an observed partial ranking. They form a computationally tractable alternative to previous approaches for partial rankings data. We also present a novel variance reduction scheme based on an antithetic variate construction between permutations to obtain an improved estimator. An overview of the existing kernels and metrics for permutations is also provided.

    07/01/2018 ∙ by María Lomelí, et al. ∙ 2 share

    read it

  • Variational Gaussian Dropout is not Bayesian

    Gaussian multiplicative noise is commonly used as a stochastic regularisation technique in training of deterministic neural networks. A recent paper reinterpreted the technique as a specific algorithm for approximate inference in Bayesian neural networks; several extensions ensued. We show that the log-uniform prior used in all the above publications does not generally induce a proper posterior, and thus Bayesian inference in such models is ill-posed. Independent of the log-uniform prior, the correlated weight noise approximation has further issues leading to either infinite objective or high risk of overfitting. The above implies that the reported sparsity of obtained solutions cannot be explained by Bayesian or the related minimum description length arguments. We thus study the objective from a non-Bayesian perspective, provide its previously unknown analytical form which allows exact gradient evaluation, and show that the later proposed additive reparametrisation introduces minima not present in the original multiplicative parametrisation. Implications and future research directions are discussed.

    11/08/2017 ∙ by Jiri Hron, et al. ∙ 0 share

    read it

  • Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning

    Off-policy model-free deep reinforcement learning methods using previously collected data can improve sample efficiency over on-policy policy gradient techniques. On the other hand, on-policy algorithms are often more stable and easier to use. This paper examines, both theoretically and empirically, approaches to merging on- and off-policy updates for deep reinforcement learning. Theoretical results show that off-policy updates with a value function estimator can be interpolated with on-policy policy gradient updates whilst still satisfying performance bounds. Our analysis uses control variate methods to produce a family of policy gradient algorithms, with several recently proposed algorithms being special cases of this family. We then provide an empirical comparison of these techniques with the remaining algorithmic details fixed, and show how different mixing of off-policy gradient estimates with on-policy samples contribute to improvements in empirical performance. The final algorithm provides a generalization and unification of existing deep policy gradient techniques, has theoretical guarantees on the bias introduced by off-policy updates, and improves on the state-of-the-art model-free deep RL methods on a number of OpenAI Gym continuous control benchmarks.

    06/01/2017 ∙ by Shixiang Gu, et al. ∙ 0 share

    read it

  • General Latent Feature Modeling for Data Exploration Tasks

    This paper introduces a general Bayesian non- parametric latent feature model suitable to per- form automatic exploratory analysis of heterogeneous datasets, where the attributes describing each object can be either discrete, continuous or mixed variables. The proposed model presents several important properties. First, it accounts for heterogeneous data while can be inferred in linear time with respect to the number of objects and attributes. Second, its Bayesian nonparametric nature allows us to automatically infer the model complexity from the data, i.e., the number of features necessary to capture the latent structure in the data. Third, the latent features in the model are binary-valued variables, easing the interpretability of the obtained latent features in data exploration tasks.

    07/26/2017 ∙ by Isabel Valera, et al. ∙ 0 share

    read it

  • Improving Output Uncertainty Estimation and Generalization in Deep Learning via Neural Network Gaussian Processes

    We propose a simple method that combines neural networks and Gaussian processes. The proposed method can estimate the uncertainty of outputs and flexibly adjust target functions where training data exist, which are advantages of Gaussian processes. The proposed method can also achieve high generalization performance for unseen input configurations, which is an advantage of neural networks. With the proposed method, neural networks are used for the mean functions of Gaussian processes. We present a scalable stochastic inference procedure, where sparse Gaussian processes are inferred by stochastic variational inference, and the parameters of neural networks and kernels are estimated by stochastic gradient descent methods, simultaneously. We use two real-world spatio-temporal data sets to demonstrate experimentally that the proposed method achieves better uncertainty estimation and generalization performance than neural networks and Gaussian processes.

    07/19/2017 ∙ by Tomoharu Iwata, et al. ∙ 0 share

    read it

  • One-Shot Learning in Discriminative Neural Networks

    We consider the task of one-shot learning of visual categories. In this paper we explore a Bayesian procedure for updating a pretrained convnet to classify a novel image category for which data is limited. We decompose this convnet into a fixed feature extractor and softmax classifier. We assume that the target weights for the new task come from the same distribution as the pretrained softmax weights, which we model as a multivariate Gaussian. By using this as a prior for the new weights, we demonstrate competitive performance with state-of-the-art methods whilst also being consistent with 'normal' methods for training deep networks on large data.

    07/18/2017 ∙ by Jordan Burgess, et al. ∙ 0 share

    read it

  • Adversarial Examples, Uncertainty, and Transfer Testing Robustness in Gaussian Process Hybrid Deep Networks

    Deep neural networks (DNNs) have excellent representative power and are state of the art classifiers on many tasks. However, they often do not capture their own uncertainties well making them less robust in the real world as they overconfidently extrapolate and do not notice domain shift. Gaussian processes (GPs) with RBF kernels on the other hand have better calibrated uncertainties and do not overconfidently extrapolate far from data in their training set. However, GPs have poor representational power and do not perform as well as DNNs on complex domains. In this paper we show that GP hybrid deep networks, GPDNNs, (GPs on top of DNNs and trained end-to-end) inherit the nice properties of both GPs and DNNs and are much more robust to adversarial examples. When extrapolating to adversarial examples and testing in domain shift settings, GPDNNs frequently output high entropy class probabilities corresponding to essentially "don't know". GPDNNs are therefore promising as deep architectures that know when they don't know.

    07/08/2017 ∙ by John Bradshaw, et al. ∙ 0 share

    read it

  • Lost Relatives of the Gumbel Trick

    The Gumbel trick is a method to sample from a discrete probability distribution, or to estimate its normalizing partition function. The method relies on repeatedly applying a random perturbation to the distribution in a particular way, each time solving for the most likely configuration. We derive an entire family of related methods, of which the Gumbel trick is one member, and show that the new methods have superior properties in several settings with minimal additional computational cost. In particular, for the Gumbel trick to yield computational benefits for discrete graphical models, Gumbel perturbations on all configurations are typically replaced with so-called low-rank perturbations. We show how a subfamily of our new methods adapts to this setting, proving new upper and lower bounds on the log partition function and deriving a family of sequential samplers for the Gibbs distribution. Finally, we balance the discussion by showing how the simpler analytical form of the Gumbel trick enables additional theoretical results.

    06/13/2017 ∙ by Matej Balog, et al. ∙ 0 share

    read it

  • General Latent Feature Models for Heterogeneous Datasets

    Latent feature modeling allows capturing the latent structure responsible for generating the observed properties of a set of objects. It is often used to make predictions either for new values of interest or missing information in the original data, as well as to perform data exploratory analysis. However, although there is an extensive literature on latent feature models for homogeneous datasets, where all the attributes that describe each object are of the same (continuous or discrete) nature, there is a lack of work on latent feature modeling for heterogeneous databases. In this paper, we introduce a general Bayesian nonparametric latent feature model suitable for heterogeneous datasets, where the attributes describing each object can be either discrete, continuous or mixed variables. The proposed model presents several important properties. First, it accounts for heterogeneous data while keeping the properties of conjugate models, which allow us to infer the model in linear time with respect to the number of objects and attributes. Second, its Bayesian nonparametric nature allows us to automatically infer the model complexity from the data, i.e., the number of features necessary to capture the latent structure in the data. Third, the latent features in the model are binary-valued variables, easing the interpretability of the obtained latent features in data exploratory analysis. We show the flexibility of the proposed model by solving both prediction and data analysis tasks on several real-world datasets. Moreover, a software package of the GLFM is publicly available for other researcher to use and improve it.

    06/12/2017 ∙ by Isabel Valera, et al. ∙ 0 share

    read it

  • Bayesian inference on random simple graphs with power law degree distributions

    We present a model for random simple graphs with a degree distribution that obeys a power law (i.e., is heavy-tailed). To attain this behavior, the edge probabilities in the graph are constructed from Bertoin-Fujita-Roynette-Yor (BFRY) random variables, which have been recently utilized in Bayesian statistics for the construction of power law models in several applications. Our construction readily extends to capture the structure of latent factors, similarly to stochastic blockmodels, while maintaining its power law degree distribution. The BFRY random variables are well approximated by gamma random variables in a variational Bayesian inference routine, which we apply to several network datasets for which power law degree distributions are a natural assumption. By learning the parameters of the BFRY distribution via probabilistic inference, we are able to automatically select the appropriate power law behavior from the data. In order to further scale our inference procedure, we adopt stochastic gradient ascent routines where the gradients are computed on minibatches (i.e., subsets) of the edges in the graph.

    02/27/2017 ∙ by Juho Lee, et al. ∙ 0 share

    read it