Deep learning studies functions represented as compositions of other functions, . Compositions of functions is a natural way to model data generated by a hierarchical process. Each represents a certain part of the hierarchy, and the prior assumptions on reflect the corresponding prior assumptions about the data generating process. Given these prior assumptions, we can compute the posterior distributions of
and by doing so uncover the structure of the data and explicitly estimate the uncertainties due to each function (or layer) in the composition.
The uncertainties in each give rise to what we call compositional uncertainty: even noiseless observed data could be generated by compositions of a multitude of different functions which are consistent with the prior (for example, see Fig. 1). An example of a problem where computing such uncertainty is important is alignment of temporal signals Kaiser:2018 ; Kazlauskaite:2019
, where it is informative to not only compute a point estimate of the temporal warps aligning the signals to each other, but also see the range and likelihood of different possible warps. Another example is transfer learning where we assume that two hierarchical modelsand share a common part of the hierarchy ( in this example). Having fitted , we can fit to a different data set (or domain) by reusing from and fitting only . In this case it is important to capture a wide distribution of possible realisations of , such that it adequately models a common part of and (as opposed to finding a single realisation of , only useful for explaining the data set we used for fitting ).
The research in deep neural networks (DNNs), however, has mostly focused on networks built using a very large number of simple functions where the goal is to fit the entire composition to the dataGal:2016 , while the computations by individual functions in the composition or parts of the network are often irrelevant and not interpretable. In other words, DNNs generally do not model a hierarchical process but only the (predictive) distribution of the data. The model is thus much closer to a single-layer GP and the issue of compositional uncertainties are ignored altogether.
DGPs Damianou:2013 , which are compositions of GPs, allow us to impose explicit prior assumptions on by choosing the kernels, and perform Bayesian inference to compute the posterior over each layer that is consistent with the observed data. Such posteriors would capture the compositional uncertainty showing the range of transformations fitting the data. We note that DGPs are inherently unidentifiable, since different compositions can fit the data equally well, and we argue that it should be captured by an adequate Bayesian posterior. However, exact Bayesian inference in DGPs is intractable Damianou:2013 and we have to resort to approximations. We show that typically used approximate inference schemes make strong simplifying assumptions resulting in intermediate layers of a DGP collapsing to deterministic transformations. In certain cases (e.g. when we have weak priors on the layers or uncertainty is irrelevant for the application) such behaviour is not an issue, however, in general that prevents us from using the power of probabilistic models to capture the uncertainty in intermediate layers. By highlighting these limitations of the current inference schemes and suggesting their modifications, we aim to show that the assumptions on the approximate inference scheme are central to the estimation of compositional uncertainty.
2 Issues with compositional uncertainty in DGPs
DGPs are compositions of functions , where . Having observed the data set , the marginal likelihood
is intractable since it requires integration of the non-linear covariance matrices in the terms . As proposed by Damianou:2013 ; Salimbeni:2017 , a lower bound on this intractable integral can be estimated using variational approximations based on augmenting the GPs with inducing points , which are treated as variational parameters. Conditioned on the inducing points, the output distribution of each layer can be computed as a GP posterior distribution, treating as (pseudo-) observations. Introducing a variational distribution , the marginal likelihood lower bound is computed as (see Salimbeni:2017 for further details):
2.1 Collapse of intermediate layers to deterministic transformations
The variational distribution over the inducing points is typically chosen to be factorised as , however, that leads to the layers of a DGP collapsing to deterministic transformations Havasi:2018 . As demonstrated in Figs. 2 (first row) and 3, different random samples from a DGP fitting the same data look essentially the same, meaning there is almost no uncertainty captured about the transformations in intermediate layer. At the same time, fitted DGPs corresponding to different random initialisation converge to different solutions, indicating that there are multiple compositions of these three functions fitting the data, which should be captured as part of the compositional uncertainty.
We argue that this uncertainty collapse is due to the factorisation of the variational distribution over the inducing points. The inducing points essentially define the transformation in each layer (more precisely, the output distribution of each layer is parametrised by the inducing points). Hence the variational distribution induces a distribution over the mappings implemented by the layer, but these transformations in different layers are independent. However, in order to fit fixed data, the layers must be dependent (e.g. if implements a blue transformation in Fig. 2, must implement a blue transformation as well, otherwise the entire composition does not fit the data), and that the only way to achieve that with a factorised variational distribution is to make each factor essentially a point mass on some transformation. This is akin to subsequent layers having noisy inputs as in Girard:2003 , which leads to increased output uncertainty. Another illustration of this idea is provided in Fig. 1, where we fitted a model with correlated rotations and translations, allowing us to see the variety of possible motions of the square. However, a model with independent transformations would converge to a single possible sequence of rotations and translations.
3 Modeling dependencies between layers
We discuss and compare two ways of introducing the dependencies between the inducing points in order to capture the compositional uncertainty.
Jointly Gaussian inducing points
A straightforward way to introduce dependencies between the inducing points is to define with a joint covariance matrix across layers. In this case, the expectation in Eq. (2) can be approximated numerically by drawing a sample from , then, conditioned on the sampled , drawing the DGP output sample by sequentially sampling from GP posterior distributions in each layer (in the same way as in Salimbeni:2017 ), and finally using these sampled DGP outputs to compute a Monte-Carlo estimate of the expectation. The reparametrisation trick Kingma:2013
for the Gaussian distribution permits computation of the gradients of the variational parameters. In Figs.2 (second row) and 4 we show examples of fits of DGPs with jointly Gaussian inducing points. In comparison to Figs. 2 (first row) and 3, such an approach indeed retains more of the (compositional) posterior uncertainty about the transformations in the intermediate layers avoiding collapsing to point estimates. However, different initialisation still result in different posteriors, highlighting the limited capacity of the Gaussian distributions to capture the complex (e.g. multimodal) posteriors, as previously discussed in Havasi:2018 .
Variational distributions of outputs of intermediate layers
Modelling the correlations between the inducing points directly does not scale with the number of inducing points and with the depth of a DGP. To address these issues we propose the following variational distribution and inference scheme.
Introducing variables for the outputs of the intermediate layers (i.e. , and
), the joint distribution of a DGP is as follows:
where the terms are the GP posteriors given the inducing points at inputs .
DGPs have a hierarchical chain structure such that GPs are independent conditioned on . That allows us to introduce a factorised distribution of inducing points conditioned on . Specifically, we define
The terms are free-form Gaussian distributions. The terms are the distributions of inducing points mapping fixed input to . In this case, the optimal variational distribution is available in closed form, (Titsias:2009, , see Eq. (10)).
The variational distribution in Eq. (4) leads to the following likelihood lower bound:
The first expectation in this lower bound can be estimated by sampling the DGP outputs similarly to Salimbeni:2017 . Expectations in terms (*) and (**) are local (restricted to a single layer), so they can either be approximated directly by three nested Monte-Carlo estimators Rainforth:2018 , or reduced to Monte-Carlo estimates only over (these terms appear in the kernel matrices) using the following observations:
In integrating over we can obtain a log Gaussian density over plus some terms involving only. Then we can compute a Gaussian KL-divergence over obtaining a function of only,
In we can directly compute the KL-divergence over . It is a quadratic function of , which can be integrated over the Gaussian density , again obtaining a function of only.
Example fits of DGPs using the variational distribution (4) are shown in Figs. 2 (third row) and 5. The posteriors are qualitatively similar to those obtained using a jointly Gaussian variational distribution, with an advantage of number of parameters in the proposed distribution scaling linearly rather than quadratically with the number of layers. Such a variational distribution also cannot capture multimodal posteriors, requiring future work on more flexible variational distributions.
A. Damianou and N. Lawrence.
Deep gaussian processes.
International Conference on Artificial Intelligence and Statistics (AISTATS), 2013.
-  Yarin Gal. Uncertainty in Deep Learning. PhD thesis, University of Cambridge, 2016.
-  A. Girard, C. E. Rasmussen, J. Q. Candela, and R. Murray-Smith. Gaussian process priors with uncertain inputs application to multiple-step ahead time series forecasting. In Advances in Neural Information Processing Systems (NIPS), 2003.
-  M. Havasi, J. M. Hernández-Lobato, and J. J. Murillo-Fuentes. Inference in deep gaussian processes using stochastic gradient hamiltonian monte carlo. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
-  M. Kaiser, C. Otte, T. Runkler, and C. H. Ek. Bayesian alignments of warped multi-output gaussian processes. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
-  I. Kazlauskaite, C. H. Ek, and N.F.D. Campbell. Gaussian process latent variable alignment learning. In International Conference on Artificial Intelligence and Statistics (AISTATS). PMLR, 2019.
-  D. P Kingma and M. Welling. Auto-encoding variational bayes. In International Conference on Representation Learning (ICLR), 2014.
T. Rainforth, R. Cornish, H. Yang, A. Warrington, and F. Wood.
On nesting monte carlo estimators.
Proceedings of the 35th International Conference on Machine Learning (ICML), 2018.
-  H. Salimbeni and M. Deisenroth. Doubly stochastic variational inference for deep gaussian processes. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
-  M. Titsias. Variational Learning of Inducing Variables in Sparse Gaussian Processes. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2009.