Probabilistic directed generative models are flexible tools that have recently captured the attention of the the Deep Learning community (Uria et al., 2014; Mnih & Gregor, 2014; Kingma & Welling, 2013; Rezende et al., 2014)
. These models have the ability to produce samples able to mimic the learned data and they allow principled assessment of the uncertainty in the predictions. These properties are crucial to successfully addressing challenges such as uncertainty quantification or data imputation and allow the ideas of deep learning to be extended to related machine learning fields such as probabilistic numerics.
The main challenge is that exact inference on directed nonlinear probabilistic models is typically intractable due to the required marginalisation of the latent components. This has lead to the development of probabilistic generative models based on neural networks(Kingma & Welling, 2013; Mnih & Gregor, 2014; Rezende et al., 2014), in which probabilistic distributions are defined for the input and output of individual layers. Efficient approximated inference methods have been developed in this context based on stochastic variational inference or stochastic back-propagation. However, a question that remains open is how to properly regularize the model parameters. Techniques such as dropout have been used to avoid over-fitting (Hinton et al., 2012)
. Alternatively, Bayesian inference offers a mathematically grounded framework for regularization.Blundell et al. (2015) show that Bayesian (variational) inference outperforms dropout. Kingma et al. (2015); Gal & Ghahramani (2015) have shown that dropout itself can be reformulated in the variational inference context.
In this work, we develop a new scalable Bayesian non-parametric generative model. We focus on a deep Gaussian processes (DGP) that we augment by means of a recognition model, a multi-layer perceptron (MLP) between the latent representation of layers of the DGP. This allows us to simplify the inference and to avoid the challenge of initializing variational parameters. In addition, although DGP have been used only in small scale data so far, we show how it is possible to scale these models by means of new formulation of the lower bound that allows to distribute most of the computation.
The main contributions of this work are: i) a novel extension to DGPs by means of a recognition model that we call Variational Auto-Encoded deep Gaussian process (VAE-DGP), ii) a derivation of the distributed variational lower bound of the model and iii) a demonstration of the utility of the model on several mainstream deep learning datasets.
2 Deep Gaussian Processes
Gaussian processes provide flexible, non-parametric, probabilistic approaches to function estimation. However, their tractability comes at a price: they can only represent a restricted class of functions. Indeed, even though sophisticated definitions and combinations of covariance functions can lead to powerful models(Durrande et al., 2011; Gönen & Alpaydin, 2011; Hensman et al., 2013; Duvenaud et al., 2013; Wilson & Adams, 2013)
, the assumption about joint normal distribution of instantiations of the latent function remains; this limits the applicability of the models. One line of recent research to address this limitation focused on function composition(Snelson et al., 2004; Calandra et al., 2014). Inspired by deep neural networks, a deep Gaussian process instead employs process composition (Lawrence & Moore, 2007; Damianou et al., 2011; Lázaro-Gredilla, 2012; Damianou & Lawrence, 2013; Hensman & Lawrence, 2014).
A deep GP is a deep directed graphical model that consists of multiple layers of latent variables and employs Gaussian processes to govern the mapping between consecutive layers (Lawrence & Moore, 2007; Damianou, 2015). Observed outputs are placed in the down-most layer and observed inputs (if any) are placed in the upper-most layer, as illustrated in Figure 1. More formally, consider a set of data with datapoints and dimensions. A deep GP then defines layers of latent variables, through the following nested noise model definition:
where the functions are drawn from Gaussian processes with covariance functions , i.e. . In the unsupervised case, the top hidden layer is assigned a unit Gaussian as a fairly uninformative prior which also provides soft regularization, i.e.
. In the supervised learning scenario, the inputs of the top hidden layer is observed and govern its hidden outputs.
The expressive power of a deep GP is significantly greater than that of a standard GP, because the successive warping of latent variables through the hierarchy allows for modeling non-stationarities and sophisticated, non-parametric functional “features” (see Figure 2). Similarly to how a GP is the limit of an infinitely wide neural network, a deep GP is the limit where the parametric function composition of a deep neural network turns into a process composition. Specifically, a deep neural network can be written as:
where and are parameter matrices and
denotes an activation function. By non-parametrically treating the stacked function compositionas process composition we obtain the deep GP definition of Equation 2.
2.1 Variational Inference
In a standard GP model, inference is performed by analytically integrating out the latent function . In the DGP case, the latent variables have to additionally be integrated out, to obtain the marginal likelihood of DGPs over the observed data:
The above marginal likelihood and the following derivation aims at unsupervised learning problems, however, it is straight-forward to extend the formulation to supervised scenario by assuming observed . Bayesian inference in DGPs involves optimizing the model hyper-parameters with respect to the marginal likelihood and inferring the posterior distributions of latent variables for training/testing data. The exact inference of DGPs is intractable due to the intractable integral in (4). Approximated inference techniques such as variational inference and EP have been developed (Damianou & Lawrence, 2013; Bui et al., 2015). By taking a variational approach, i.e. assuming a variational posterior distribution of latent variables, , a lower bound of the log marginal distribution can be derived as
where and , are known as free energy for individual layers. denotes the entropy of the variational distribution and
denotes the Kullback-Leibler divergence betweenand . According to the model definition, both and are Gaussian processes. The variational distribution of
is typically parameterized as a Gaussian distribution.
3 Variational Auto-Encoded Model
Damianou & Lawrence (2013) provides a tractable variational inference method for DGP by deriving a closed-form lower bound of the marginal likelihood. While successfully demonstrating strengths of DGP, the experiments that they show are limited to very small scales (hundreds of datapoints). The limitation on scalability is mostly due to the computational expensive covariance matrix inversion and the large number of variational parameters (growing linearly with the size of data).
To scale up DGP to handle large datasets, we propose a new deep generative model, by augmenting DGP with a variationally auto-encoded inference mechanism. We refer to this inference mechanism as a recognition model (see Figure 3). A recognition model provides us with a mechanism for constraining the variational posterior distributions of latent variables. Instead of representing variational posteriors as individual variational parameters, which become a big burden to optimization, we define them as a transformation of observed data. This allows us to reduce the number of parameters for optimization (which no longer grow linearly with the size of data) and to perform fast inference at test time. A similar constraint mechanism has been referred to as a “back-constraint” in the GP literature. Lawrence & Quiñonero Candela (2006)
constrained the latent inputs of a GP with a parametric model to enforce local distance preservation in the inputs;Ek et al. (2008) followed the same approach for constraining the latent space with information from additional views of the data. Our formulation differs from the above in that we rather constrain a whole latent posterior distribution through the variational parameters. Damianou & Lawrence (2015) also constrained the posterior, but this was achieved using a direct specific parameterization for that distribution, making this back-constraint grow with the number of inputs. Another difference to the previous approaches is that we consider deep hierarchies of latent spaces and, consequently, of recognition models. Our constraint mechanism is more similar to that of other variationally auto-encoded models, such as (Salakhutdinov & Hinton, 2008; Snoek et al., 2012a; Kingma & Welling, 2013; Mnih & Gregor, 2014; Rezende et al., 2014). The main differences with our work is that are that we have a Bayesian non-parametric generative model and a closed-form variational lower bound. This enables us to be Bayesian when inferring the generative distribution and avoids sampling from variational posterior distributions.
Specifically, for the observed layer, the posterior mean of the variational distribution is defined as a transformation of the observed data:
where the transformation function is parameterized by a multi-layer perceptron (MLP). Similarly, for the hidden layers, the posterior mean is defined as a transformation of the posterior mean from the lower layer:
Note that all the transformation functions are deterministic, therefore, the posterior mean of all the hidden layers can be viewed as direct transformations of the observed data, i.e.
. We use the hyperbolic tangent activation function for all the MLPs. The posterior variancesare assumed to be diagonal and the same across all the datapoints.
The closed-form variational lower bound allows us to apply sophisticated gradient optimization methods such as L-BFGS. It avoids the problem of initializing and optimizing a large number of variational parameters. The initialization of variational parameters are converted into the initialization of neural network parameters, which has been well studied in deep learning literature. Furthermore, with the reparameterization, the variational parameters are moved coherently during optimization through the changes of neural network mapping. This helps the model avoid local optima and approach better solutions. Figure 3(a) shows an example of the learned 2D latent space of one layer (shallow) DGP and VAE-DGP from the same initialization. Clearly, the recognition model in VAE-DGP helps move the datapoints to a better solution. Note that the recognition model serves as a (deterministic) reparameterization of variational parameters. Therefore, the parameters of MLP are the variational parameters of our model. As automatically “regularized” by Bayesian inference, a overly complicated cognition model will not cause the generative model to overfit. This allows us to freely choose a powerful enough recognition model (see Fig. 3(b) for an example333Note that the shown training and test log-likelihood are not directly comparable. The shown train log-likelihood is the lower bound in Equation 5 divided by the size of data. The shown test log-likelihood is an approximation: .).
Computationally, the recognition model re-parameterization resolves the linear growing of the number of variational parameters with respect to the size of data. Based on this formulation, we develop a distributed variational inference approach, which is described in detail in the following section.
4 Distributed Variational Inference
The exact evaluation of the variational lower bound in Equation (5) is still intractable due to the expectation in the free energy terms. A variational approximation technique developed for Bayesian Gaussian Process Latent variable Model (BGPLVM) (Titsias & Lawrence, 2010) can be applied to obtain a lower bound of these free energy terms. Taking the observed layer as an example, by introducing noise-free observations , a set of auxiliary variable namely inducing variable and a set of variational parameter namely inducing inputs , the conditional distribution is reformulated as
where each row of represents an inducing variable which is associated with the inducing input at the same row of . Assuming a particular form of the variational distribution of and : , the free energy of the observed layer can be lower bounded by
As shown by Titsias & Lawrence (2010), this lower bound can be formulated in closed-form for kernels like linear, exponentiated quadratic. For other kernels, it can be computed approximately by using the techniques such as Gaussian quadrature. Note that the optimal value of can be derived in closed-form by setting its gradient to zero, therefore, the only variational parameters that we need to optimize for the observed layer are and .
For the hidden layers, the variational posterior distributions are slightly different, because the posterior of inducing variables depend on the output variable of that layer. For the -th hidden layer, the variational posterior distribution is, therefore, defined as . Similar to the observed layer, a lower bound of the free energy can be derived as:
The computation of the lower bounds of free energy terms is expensive. This limits the scalability of the original DGP. Fortunately, with the introduced auxiliary variables and the recognition model, most of the computation is distributable in a data-parallelism fashion. We exploit this fact and derive a distributed formulation of the lower bound. This allows us to scale up our inference method to large data. Specifically, the lower bound of the free energy consists of a few terms (explained below) that depend on the size of data: , , and . All of them can be formulated as a sum of intermediate results from individual datapoints:
where , are the covariance matrices of and respectively, is the cross-covariance matrix between and , and , and , and . This enables data-parallelism by distributing the computation that depends on individual datapoints and only collecting the intermediate results that do not scale with the size of data. Gal et al. (2014) and Dai et al. (2014) exploit a similar formulation for distributing the computation of BGPLVM, however, in their formulations, the gradients of variational parameters that depend on individual datapoints have to be collected centrally. Such collection severely limits the scalability of the model.
For hidden layers, the free energy terms are slightly different. Their data-dependent terms additionally involve the expectation with respect to the variational distribution of output variables: , , and . The first term can be naturally reformulated as a sum across datapoints:
For the second term, we can rewrite , where , and . This enables us to formulate it into a distributable form:
With the above formulations, we obtain distributable a variational lower bound. For optimization, the gradients of all the model and variational parameters can be derived with respect to the lower bound. As the variational distributions are computed according to the recognition model, the gradients of are back-propagated (through the recognition model), which allows to compute the gradients of its the parameters.
As a probabilistic generative model, VAE-DGP is applicable to a range of different tasks such as data generation, data imputation, etc. In this section we evaluate our model in a variety of problems and compare it with the alternatives in the in the literature.
5.1 Unsupervised Learning
|Stacked CAE||121 1.6|
|Deep GSN||214 1.1|
|Adversarial nets||225 2|
We first apply to our model to the combination of Frey faces and Yale faces (Frey-Yale). The Frey faces contains 1956 frames taken from a video clip. The Yale faces contains 2414 images, which are resized to . We take the last 200 frames from the Frey faces and 300 images randomly from Yale faces as the test set and use the rest for training. The intensity of the original gray-scale images are normalized to . The applied VAE-DGP has two hidden layers (a 2D top hidden layer and a 20D middle hidden layer). The exponentiated quadratic kernel is used for all the layers with 100 inducing points. All the MLPs in the recognition model have two hidden layers with widths (500-300). As a generative model, we can draw samples from the learned model by sampling first from the prior distribution of the top hidden layer (a 2D unit Gaussian distribution in this case) and layer-wise downwards. The generated images are shown in Figure 4(a).
To evaluate the ability of our model learning the data distribution, we train the VAE-DGP on MNIST (LeCun et al., 1998). We use the whole training set for learning, which consists of 60,000 images. The intensity of the original gray-scale images are normalized to . We train our model with three different model settings (one, two and three hidden layers). The trained models are evaluated by the log-likelihood of the test set444 As a non-parametric model, the test log-likelihood of VAE-DGP is formulated as
As a non-parametric model, the test log-likelihood of VAE-DGP is formulated as, where is the test data and is the training data. As the true test log-likelihood is intractable, we approximate it as ., which consists of 10,000 images. The results are shown in Table 1 along with some baseline performances taken from the literature. The numbers in the parenthesis indicate the dimensionality of hidden layers from top to bottom. The exponentiated quadratic kernel are used for all the layers with 300 inducing points. All the MLPs in the recognition model has two hidden layers with width (500-300). All our models are trained as a whole from randomly initialized recognition model.
5.2 Data Imputation
We demonstrate the model’s ability to impute missing data by showing half of images on the test set. We use the learned VAE-DGP to impute the other half of the images. this is challenging problem because there might be ambiguities in the answers. For instance, by showing the right half of a digit “8”, the answers “3” and “8” are both reasonable. We show the imputation performance for the test images in Frey-Yale and MNIST in Fig. 4(b) and Fig. 6 respectively. We also apply VAE-DGP to the street view house number dataset (SVHN) (Netzer et al., 2011). We use three hidden layers with the dimensionality of latent space from top to bottom (5-30-500). The top two hidden layers use the exponentiated quadratic kernel and the observed layer uses the linear kernel with 500 inducing points. The learned model is used for imputing the images in the test set (see Fig. 4(c)).
|Lin. Reg.||917.31 53.76|
|Lin. Reg.||1865.76 23.36|
5.3 Supervised Learning and Bayesian Optimization
In this section we consider two supervised learning problem instances: regression and Bayesian optimization (BO) (Osborne, 2010; Snoek et al., 2012b). We demonstrate the utility of VEA-DGP in these settings by evaluating its performance in terms of predictive accuracy and predictive uncertainty quantification. For these experiments we use a VEA-DGP with one hidden layer (and one observed inputs layer) and exponentiated quadratic covariance functions. Furthermore, we incorporate the deep GP modification of Duvenaud et al. (2014) so that the observed input layer has an additional connection to the output layer. Duvenaud et al. (2014) showed that this modification increases the general stability of the method. Since the sample size of the data considered for supervised learning is relatively small, we do not use the recognition model to back-constrain the variational distributions.
In the regression experiments we use the
Abalone dataset ( -dimensional outputs and dimensional inputs) from UCI and the
Creep dataset ( -dimensional outputs and dimensional inputs) from (Cole et al., 2000). A typical split for this data is to use (Abalone) and (Creep) instaces for training. We used inducing inputs for each layer and performed 4 runs with different random splits. We summarize the results in Table 2.
Next, we show how the VAE-DGP can be used in the context of probabilistic numerics, in particular for Bayesian optimization (BO) (Osborne, 2010; Snoek et al., 2012b). In BO, the goal is to find for where a limited number of evaluations are available. Typically, a GP is used to fit the available data, as a surrogate model. The GP is iteratively updated with new function evaluations and used to build an acquisition function able to guide the collection of new observations of . This is done by balancing exploration (regions with large uncertainty) and exploitation (regions with a low mean). In BO, the model is a crucial element of the process: it should be able to express complex classes of functions and to provide coherent estimates of the function uncertainty. In this experiment we use the non-stationary Branin function555See http://www.sfu.ca/~ssurjano/optimization.html for details. The default domain is in the experiments. to compare the performance of standard GPs and the VEA-DPP in the context of BO. We used the popular expected improvement (Jones et al., 1998) acquisition function and we ran 10 replicates of the experiment using different initializations, each kicking-off optimization with 3 points randomly selected from the functions’ domain . In each replicate we iteratively collected 20 evaluations of the functions. In the VEA-DPP we used 30 inducing points. Figure 7 shows that using the VEA-DPP as a surrogate model results in a gain, especially in the first steps of the optmization. This is due to ability of VEA-DPP to deal with the non-stationary components of the function and to model a much richer class of distributions (e.g. multi-modal) in the output layer (as opposed to the standard GP which assumes joint Gaussianity in the outputs).
We have proposed a new deep non-parametric generative model. Although general enough to be used in supervised and unsupervised problems, we especially highlighted its usefulness in the latter scenario, a case which is known to be a major challenge for current deep machine learning approaches. Our model is based on a deep Gaussian process, which we extended with a layer-wise parameterization through multilayer perceptrons, significantly simplifying optimization. Additionally, we developed a new formulation of the lower bound that allows for distributed computations. Overall, our approach is able to perform Bayesian inference using large datasets and compete with current alternatives. Future developments include the regularization of the perceptron weights, to reformulate the current setup for the context of multi-view problems and to incorporate convolutional structures into the objective function.
acknowledgement. The authors thank the financial support of RADIANT (EU FP7-HEALTH Project Ref 305626), BBSRC Project No BB/K011197/1 and WYSIWYD (EU FP7-ICT Project Ref 612139).
- Bengio et al. (2013) Bengio, Yoshua, Mesnil, Grégoire, Dauphin, Yann, and Rifai, Salah. Better Mixing via Deep Representations. In International Conference on Machine Learning, 2013.
- Bengio et al. (2014) Bengio, Yoshua, Laufer, Eric, Alain, Guillaume, and Yosinski, Jason. Deep Generative Stochastic Networks Trainable by Backprop. In International Conference on Machine Learning, 2014.
- Blundell et al. (2015) Blundell, Charles, Cornebise, Julien, Kavukcuoglu, Koray, and Wierstra, Daan. Weight Uncertainty in Neural Networks. In International Conference on Machine Learning, 2015.
Bui et al. (2015)
Bui, Thang D., Hernández-Lobato, José Miguel, Li, Yingzhen,
Hernández-Lobato, Daniel, and Turner, Richard E.
Training Deep Gaussian Processes using Stochastic Expectation Propagation and Probabilistic Backpropagation.In Workshop on Advances in Approximate Bayesian Inference, NIPS, 2015.
- Calandra et al. (2014) Calandra, Roberto, Peters, Jan, Rasmussen, Carl Edward, and Deisenroth, Marc Peter. Manifold Gaussian processes for regression. Technical report, 2014.
- Cole et al. (2000) Cole, D, Martin-Moran, C, Sheard, AG, Bhadeshia, HKDH, and MacKay, DJC. Modelling creep rupture strength of ferritic steel welds. Science and Technology of Welding & Joining, 5(2):81–89, 2000.
- Dai et al. (2014) Dai, Zhenwen, Damianou, Andreas, Hensman, James, and Lawrence, Neil. Gaussian process models with parallelization and GPU acceleration, 2014.
- Damianou (2015) Damianou, Andreas. Deep Gaussian processes and variational propagation of uncertainty. PhD Thesis, University of Sheffield, 2015.
Damianou & Lawrence (2015)
Damianou, Andreas and Lawrence, Neil.
Semi-described and semi-supervised learning with Gaussian processes.In
31st Conference on Uncertainty in Artificial Intelligence (UAI), 2015.
- Damianou & Lawrence (2013) Damianou, Andreas and Lawrence, Neil D. Deep Gaussian processes. In Carvalho, Carlos and Ravikumar, Pradeep (eds.), Proceedings of the Sixteenth International Workshop on Artificial Intelligence and Statistics, volume 31, pp. 207–215, AZ, USA, 4 2013. JMLR W&CP 31.
- Damianou et al. (2011) Damianou, Andreas, Titsias, Michalis K., and Lawrence, Neil D. Variational Gaussian process dynamical systems. In Bartlett, Peter, Peirrera, Fernando, Williams, Chris, and Lafferty, John (eds.), Advances in Neural Information Processing Systems, volume 24, Cambridge, MA, 2011. MIT Press.
- Dasgupta & McAllester (2013) Dasgupta, Sanjoy and McAllester, David (eds.). Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28 of JMLR Proceedings, 2013. JMLR.org.
- Durrande et al. (2011) Durrande, N, Ginsbourger, D, and Roustant, O. Additive kernels for Gaussian process modeling. Technical report, 2011.
- Duvenaud et al. (2013) Duvenaud, David, Lloyd, James Robert, Grosse, Roger, Tenenbaum, Joshua B., and Ghahramani, Zoubin. Structure discovery in nonparametric regression through compositional kernel search. In Dasgupta & McAllester (2013), pp. 1166–1174.
- Duvenaud et al. (2014) Duvenaud, David, Rippel, Oren, Adams, Ryan, and Ghahramani, Zoubin. Avoiding pathologies in very deep networks. In Kaski, Sami and Corander, Jukka (eds.), Proceedings of the Seventeenth International Workshop on Artificial Intelligence and Statistics, volume 33, Iceland, 2014. JMLR W&CP 33.
- Ek et al. (2008) Ek, Carl Henrik, Rihan, Jon, Torr, Philip, Rogez, Gregory, and Lawrence, Neil D. Ambiguity modeling in latent spaces. In Popescu-Belis, Andrei and Stiefelhagen, Rainer (eds.), Machine Learning for Multimodal Interaction (MLMI 2008), LNCS, pp. 62–73. Springer-Verlag, 28–30 June 2008.
- Gal & Ghahramani (2015) Gal, Yarin and Ghahramani, Zoubin. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. arXiv:1506.02142, 2015.
- Gal et al. (2014) Gal, Yarin, van der Wilk, Mark, and Rasmussen, Carl E. Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models. In Advances in Neural Information Processing System, 2014.
- Gönen & Alpaydin (2011) Gönen, Mehmet and Alpaydin, Ethem. Multiple kernel learning algorithms. Journal of Machine Learning Research, 12:2211–2268, Jul 2011.
- Goodfellow et al. (2014) Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative Adversarial Networks. In Advances in Neural Information Processing Systems, 2014.
- Hensman & Lawrence (2014) Hensman, James and Lawrence, Neil D. Nested variational compression in deep Gaussian processes. Technical report, University of Sheffield, 2014.
- Hensman et al. (2013) Hensman, James, Lawrence, Neil D., and Rattray, Magnus. Hierarchical Bayesian modelling of gene expression time series across irregularly sampled replicates and clusters. BMC Bioinformatics, 14(252), 2013. doi: doi:10.1186/1471-2105-14-252.
- Hinton et al. (2012) Hinton, Geoffrey E., Srivastava, Nitish, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv: 1207.0580, 2012.
- Jones et al. (1998) Jones, Donald R., Schonlau, Matthias, and Welch, William J. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4):455–492, 1998.
- Kingma & Welling (2013) Kingma, Diederik P and Welling, Max. Auto-Encoding Variational Bayes. In ICLR, 2013.
- Kingma et al. (2015) Kingma, Diederik P., Salimans, Tim, and Welling, Max. Variational Dropout and the Local Reparameterization Trick. In Advances in Neural Information Processing System, 2015.
- Lawrence & Moore (2007) Lawrence, Neil D. and Moore, Andrew J. Hierarchical Gaussian process latent variable models. In Ghahramani, Zoubin (ed.), Proceedings of the International Conference in Machine Learning, volume 24, pp. 481–488. Omnipress, 2007. ISBN 1-59593-793-3.
- Lawrence & Quiñonero Candela (2006) Lawrence, Neil D. and Quiñonero Candela, Joaquin. Local distance preservation in the GP-LVM through back constraints. In Cohen, William and Moore, Andrew (eds.), Proceedings of the International Conference in Machine Learning, volume 23, pp. 513–520. Omnipress, 2006. ISBN 1-59593-383-2. doi: 10.1145/1143844.1143909.
- Lázaro-Gredilla (2012) Lázaro-Gredilla, Miguel. Bayesian warped Gaussian processes. In Bartlett, Peter L., Pereira, Fernando C. N., Burges, Christopher J. C., Bottou, Léon, and Weinberger, Kilian Q. (eds.), Advances in Neural Information Processing Systems, volume 25, Cambridge, MA, 2012.
- LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.
Li et al. (2015)
Li, Yujia, Swersky, Kevin, and Zemel, Richard.
Generative Moment Matching Networks.In International Conference on Machine Learning, 2015.
- Mnih & Gregor (2014) Mnih, A. and Gregor, K. Neural Variational Inference and Learning in Belief Networks. In International Conference on Machine Learning, 2014.
- Netzer et al. (2011) Netzer, Yuval, Wang, Tao, Coates, Adam, Bissacco, Alessandro, Wu, Bo, and Ng, Andrew Y. Reading Digits in Natural Images with Unsupervised Feature Learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
- Osborne (2010) Osborne, Michael. Bayesian Gaussian Processes for Sequential Prediction, Optimisation and Quadrature. PhD thesis, University of Oxford, 2010.
- Rezende et al. (2014) Rezende, D J, Mohamed, S, and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014.
- Salakhutdinov & Hinton (2008) Salakhutdinov, Ruslan and Hinton, Geoffrey. Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes. In Advances in Neural Information Processing Systems, volume 20, 2008.
- Snelson et al. (2004) Snelson, Edward, Rasmussen, Carl Edward, and Ghahramani, Zoubin. Warped Gaussian processes. In Thrun, Sebastian, Saul, Lawrence, and Schölkopf, Bernhard (eds.), Advances in Neural Information Processing Systems, volume 16, Cambridge, MA, 2004. MIT Press.
Snoek et al. (2012a)
Snoek, Jasper, Adams, Ryan P., and Larochelle, Hugo.
Nonparametric Guidance of Autoencoder Representations using Label Information.Journal of Machine Learning Research, 13:2567–2588, 2012a.
- Snoek et al. (2012b) Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. Practical Bayesian optimization of machine learning algorithms, pp. 2951–2959. 2012b.
- Titsias & Lawrence (2010) Titsias, Michalis K. and Lawrence, Neil D. Bayesian Gaussian process latent variable model. In Teh, Yee Whye and Titterington, D. Michael (eds.), Proceedings of the Thirteenth International Workshop on Artificial Intelligence and Statistics, volume 9, pp. 844–851, Chia Laguna Resort, Sardinia, Italy, 13-16 May 2010. JMLR W&CP 9.
- Uria et al. (2014) Uria, Benigno, Murray, Iain, and Larochelle, Hugo. A Deep and Tractable Density Estimator. In International Conference on Machine Learning, 2014.
- Wilson & Adams (2013) Wilson, Andrew Gordon and Adams, Ryan Prescott. Gaussian process kernels for pattern discovery and extrapolation. In Dasgupta & McAllester (2013), pp. 1067–1075.