1 Introduction
Probabilistic directed generative models are flexible tools that have recently captured the attention of the the Deep Learning community (Uria et al., 2014; Mnih & Gregor, 2014; Kingma & Welling, 2013; Rezende et al., 2014)
. These models have the ability to produce samples able to mimic the learned data and they allow principled assessment of the uncertainty in the predictions. These properties are crucial to successfully addressing challenges such as uncertainty quantification or data imputation and allow the ideas of deep learning to be extended to related machine learning fields such as probabilistic numerics.
The main challenge is that exact inference on directed nonlinear probabilistic models is typically intractable due to the required marginalisation of the latent components. This has lead to the development of probabilistic generative models based on neural networks
(Kingma & Welling, 2013; Mnih & Gregor, 2014; Rezende et al., 2014), in which probabilistic distributions are defined for the input and output of individual layers. Efficient approximated inference methods have been developed in this context based on stochastic variational inference or stochastic backpropagation. However, a question that remains open is how to properly regularize the model parameters. Techniques such as dropout have been used to avoid overfitting (Hinton et al., 2012). Alternatively, Bayesian inference offers a mathematically grounded framework for regularization.
Blundell et al. (2015) show that Bayesian (variational) inference outperforms dropout. Kingma et al. (2015); Gal & Ghahramani (2015) have shown that dropout itself can be reformulated in the variational inference context.In this work, we develop a new scalable Bayesian nonparametric generative model. We focus on a deep Gaussian processes (DGP) that we augment by means of a recognition model, a multilayer perceptron (MLP) between the latent representation of layers of the DGP. This allows us to simplify the inference and to avoid the challenge of initializing variational parameters. In addition, although DGP have been used only in small scale data so far, we show how it is possible to scale these models by means of new formulation of the lower bound that allows to distribute most of the computation.
The main contributions of this work are: i) a novel extension to DGPs by means of a recognition model that we call Variational AutoEncoded deep Gaussian process (VAEDGP), ii) a derivation of the distributed variational lower bound of the model and iii) a demonstration of the utility of the model on several mainstream deep learning datasets.
2 Deep Gaussian Processes
Gaussian processes provide flexible, nonparametric, probabilistic approaches to function estimation. However, their tractability comes at a price: they can only represent a restricted class of functions. Indeed, even though sophisticated definitions and combinations of covariance functions can lead to powerful models
(Durrande et al., 2011; Gönen & Alpaydin, 2011; Hensman et al., 2013; Duvenaud et al., 2013; Wilson & Adams, 2013), the assumption about joint normal distribution of instantiations of the latent function remains; this limits the applicability of the models. One line of recent research to address this limitation focused on function composition
(Snelson et al., 2004; Calandra et al., 2014). Inspired by deep neural networks, a deep Gaussian process instead employs process composition (Lawrence & Moore, 2007; Damianou et al., 2011; LázaroGredilla, 2012; Damianou & Lawrence, 2013; Hensman & Lawrence, 2014).A deep GP is a deep directed graphical model that consists of multiple layers of latent variables and employs Gaussian processes to govern the mapping between consecutive layers (Lawrence & Moore, 2007; Damianou, 2015). Observed outputs are placed in the downmost layer and observed inputs (if any) are placed in the uppermost layer, as illustrated in Figure 1. More formally, consider a set of data with datapoints and dimensions. A deep GP then defines layers of latent variables, through the following nested noise model definition:
(1)  
(2) 
where the functions are drawn from Gaussian processes with covariance functions , i.e. . In the unsupervised case, the top hidden layer is assigned a unit Gaussian as a fairly uninformative prior which also provides soft regularization, i.e.
. In the supervised learning scenario, the inputs of the top hidden layer is observed and govern its hidden outputs.
The expressive power of a deep GP is significantly greater than that of a standard GP, because the successive warping of latent variables through the hierarchy allows for modeling nonstationarities and sophisticated, nonparametric functional “features” (see Figure 2). Similarly to how a GP is the limit of an infinitely wide neural network, a deep GP is the limit where the parametric function composition of a deep neural network turns into a process composition. Specifically, a deep neural network can be written as:
(3) 
where and are parameter matrices and
denotes an activation function. By nonparametrically treating the stacked function composition
as process composition we obtain the deep GP definition of Equation 2.2.1 Variational Inference
In a standard GP model, inference is performed by analytically integrating out the latent function . In the DGP case, the latent variables have to additionally be integrated out, to obtain the marginal likelihood of DGPs over the observed data:
(4) 
The above marginal likelihood and the following derivation aims at unsupervised learning problems, however, it is straightforward to extend the formulation to supervised scenario by assuming observed . Bayesian inference in DGPs involves optimizing the model hyperparameters with respect to the marginal likelihood and inferring the posterior distributions of latent variables for training/testing data. The exact inference of DGPs is intractable due to the intractable integral in (4). Approximated inference techniques such as variational inference and EP have been developed (Damianou & Lawrence, 2013; Bui et al., 2015). By taking a variational approach, i.e. assuming a variational posterior distribution of latent variables, , a lower bound of the log marginal distribution can be derived as
(5) 
where and , are known as free energy for individual layers. denotes the entropy of the variational distribution and
denotes the KullbackLeibler divergence between
and . According to the model definition, both and are Gaussian processes. The variational distribution ofis typically parameterized as a Gaussian distribution
.3 Variational AutoEncoded Model
Damianou & Lawrence (2013) provides a tractable variational inference method for DGP by deriving a closedform lower bound of the marginal likelihood. While successfully demonstrating strengths of DGP, the experiments that they show are limited to very small scales (hundreds of datapoints). The limitation on scalability is mostly due to the computational expensive covariance matrix inversion and the large number of variational parameters (growing linearly with the size of data).
To scale up DGP to handle large datasets, we propose a new deep generative model, by augmenting DGP with a variationally autoencoded inference mechanism. We refer to this inference mechanism as a recognition model (see Figure 3). A recognition model provides us with a mechanism for constraining the variational posterior distributions of latent variables. Instead of representing variational posteriors as individual variational parameters, which become a big burden to optimization, we define them as a transformation of observed data. This allows us to reduce the number of parameters for optimization (which no longer grow linearly with the size of data) and to perform fast inference at test time. A similar constraint mechanism has been referred to as a “backconstraint” in the GP literature. Lawrence & Quiñonero Candela (2006)
constrained the latent inputs of a GP with a parametric model to enforce local distance preservation in the inputs;
Ek et al. (2008) followed the same approach for constraining the latent space with information from additional views of the data. Our formulation differs from the above in that we rather constrain a whole latent posterior distribution through the variational parameters. Damianou & Lawrence (2015) also constrained the posterior, but this was achieved using a direct specific parameterization for that distribution, making this backconstraint grow with the number of inputs. Another difference to the previous approaches is that we consider deep hierarchies of latent spaces and, consequently, of recognition models. Our constraint mechanism is more similar to that of other variationally autoencoded models, such as (Salakhutdinov & Hinton, 2008; Snoek et al., 2012a; Kingma & Welling, 2013; Mnih & Gregor, 2014; Rezende et al., 2014). The main differences with our work is that are that we have a Bayesian nonparametric generative model and a closedform variational lower bound. This enables us to be Bayesian when inferring the generative distribution and avoids sampling from variational posterior distributions.Specifically, for the observed layer, the posterior mean of the variational distribution is defined as a transformation of the observed data:
(6) 
where the transformation function is parameterized by a multilayer perceptron (MLP). Similarly, for the hidden layers, the posterior mean is defined as a transformation of the posterior mean from the lower layer:
(7) 
Note that all the transformation functions are deterministic, therefore, the posterior mean of all the hidden layers can be viewed as direct transformations of the observed data, i.e.
. We use the hyperbolic tangent activation function for all the MLPs. The posterior variances
are assumed to be diagonal and the same across all the datapoints.The closedform variational lower bound allows us to apply sophisticated gradient optimization methods such as LBFGS. It avoids the problem of initializing and optimizing a large number of variational parameters. The initialization of variational parameters are converted into the initialization of neural network parameters, which has been well studied in deep learning literature. Furthermore, with the reparameterization, the variational parameters are moved coherently during optimization through the changes of neural network mapping. This helps the model avoid local optima and approach better solutions. Figure 3(a) shows an example of the learned 2D latent space of one layer (shallow) DGP and VAEDGP from the same initialization. Clearly, the recognition model in VAEDGP helps move the datapoints to a better solution. Note that the recognition model serves as a (deterministic) reparameterization of variational parameters. Therefore, the parameters of MLP are the variational parameters of our model. As automatically “regularized” by Bayesian inference, a overly complicated cognition model will not cause the generative model to overfit. This allows us to freely choose a powerful enough recognition model (see Fig. 3(b) for an example^{3}^{3}3Note that the shown training and test loglikelihood are not directly comparable. The shown train loglikelihood is the lower bound in Equation 5 divided by the size of data. The shown test loglikelihood is an approximation: .).
Computationally, the recognition model reparameterization resolves the linear growing of the number of variational parameters with respect to the size of data. Based on this formulation, we develop a distributed variational inference approach, which is described in detail in the following section.
4 Distributed Variational Inference
The exact evaluation of the variational lower bound in Equation (5) is still intractable due to the expectation in the free energy terms. A variational approximation technique developed for Bayesian Gaussian Process Latent variable Model (BGPLVM) (Titsias & Lawrence, 2010) can be applied to obtain a lower bound of these free energy terms. Taking the observed layer as an example, by introducing noisefree observations , a set of auxiliary variable namely inducing variable and a set of variational parameter namely inducing inputs , the conditional distribution is reformulated as
(8) 
where each row of represents an inducing variable which is associated with the inducing input at the same row of . Assuming a particular form of the variational distribution of and : , the free energy of the observed layer can be lower bounded by
(9) 
As shown by Titsias & Lawrence (2010), this lower bound can be formulated in closedform for kernels like linear, exponentiated quadratic. For other kernels, it can be computed approximately by using the techniques such as Gaussian quadrature. Note that the optimal value of can be derived in closedform by setting its gradient to zero, therefore, the only variational parameters that we need to optimize for the observed layer are and .
For the hidden layers, the variational posterior distributions are slightly different, because the posterior of inducing variables depend on the output variable of that layer. For the th hidden layer, the variational posterior distribution is, therefore, defined as . Similar to the observed layer, a lower bound of the free energy can be derived as:
(10) 
With Equation (5) and (910), a closedform variational lower bound of the log marginal likelihood is defined.
The computation of the lower bounds of free energy terms is expensive. This limits the scalability of the original DGP. Fortunately, with the introduced auxiliary variables and the recognition model, most of the computation is distributable in a dataparallelism fashion. We exploit this fact and derive a distributed formulation of the lower bound. This allows us to scale up our inference method to large data. Specifically, the lower bound of the free energy consists of a few terms (explained below) that depend on the size of data: , , and . All of them can be formulated as a sum of intermediate results from individual datapoints:
where , are the covariance matrices of and respectively, is the crosscovariance matrix between and , and , and , and . This enables dataparallelism by distributing the computation that depends on individual datapoints and only collecting the intermediate results that do not scale with the size of data. Gal et al. (2014) and Dai et al. (2014) exploit a similar formulation for distributing the computation of BGPLVM, however, in their formulations, the gradients of variational parameters that depend on individual datapoints have to be collected centrally. Such collection severely limits the scalability of the model.
For hidden layers, the free energy terms are slightly different. Their datadependent terms additionally involve the expectation with respect to the variational distribution of output variables: , , and . The first term can be naturally reformulated as a sum across datapoints:
(11) 
For the second term, we can rewrite , where , and . This enables us to formulate it into a distributable form:
(12) 
With the above formulations, we obtain distributable a variational lower bound. For optimization, the gradients of all the model and variational parameters can be derived with respect to the lower bound. As the variational distributions are computed according to the recognition model, the gradients of are backpropagated (through the recognition model), which allows to compute the gradients of its the parameters.
5 Experiments
As a probabilistic generative model, VAEDGP is applicable to a range of different tasks such as data generation, data imputation, etc. In this section we evaluate our model in a variety of problems and compare it with the alternatives in the in the literature.
5.1 Unsupervised Learning
Model  MNIST 

DBN  1382 
Stacked CAE  121 1.6 
Deep GSN  214 1.1 
Adversarial nets  225 2 
GMMN+AE  282 2 
VAEDGP (5)  301.67 
VAEDGP (1050)  674.86 
VAEDGP (52050)  723.65 
We first apply to our model to the combination of Frey faces and Yale faces (FreyYale). The Frey faces contains 1956 frames taken from a video clip. The Yale faces contains 2414 images, which are resized to . We take the last 200 frames from the Frey faces and 300 images randomly from Yale faces as the test set and use the rest for training. The intensity of the original grayscale images are normalized to . The applied VAEDGP has two hidden layers (a 2D top hidden layer and a 20D middle hidden layer). The exponentiated quadratic kernel is used for all the layers with 100 inducing points. All the MLPs in the recognition model have two hidden layers with widths (500300). As a generative model, we can draw samples from the learned model by sampling first from the prior distribution of the top hidden layer (a 2D unit Gaussian distribution in this case) and layerwise downwards. The generated images are shown in Figure 4(a).
To evaluate the ability of our model learning the data distribution, we train the VAEDGP on MNIST (LeCun et al., 1998). We use the whole training set for learning, which consists of 60,000 images. The intensity of the original grayscale images are normalized to . We train our model with three different model settings (one, two and three hidden layers). The trained models are evaluated by the loglikelihood of the test set^{4}^{4}4
As a nonparametric model, the test loglikelihood of VAEDGP is formulated as
, where is the test data and is the training data. As the true test loglikelihood is intractable, we approximate it as ., which consists of 10,000 images. The results are shown in Table 1 along with some baseline performances taken from the literature. The numbers in the parenthesis indicate the dimensionality of hidden layers from top to bottom. The exponentiated quadratic kernel are used for all the layers with 300 inducing points. All the MLPs in the recognition model has two hidden layers with width (500300). All our models are trained as a whole from randomly initialized recognition model.5.2 Data Imputation
We demonstrate the model’s ability to impute missing data by showing half of images on the test set. We use the learned VAEDGP to impute the other half of the images. this is challenging problem because there might be ambiguities in the answers. For instance, by showing the right half of a digit “8”, the answers “3” and “8” are both reasonable. We show the imputation performance for the test images in FreyYale and MNIST in Fig. 4(b) and Fig. 6 respectively. We also apply VAEDGP to the street view house number dataset (SVHN) (Netzer et al., 2011). We use three hidden layers with the dimensionality of latent space from top to bottom (530500). The top two hidden layers use the exponentiated quadratic kernel and the observed layer uses the linear kernel with 500 inducing points. The learned model is used for imputing the images in the test set (see Fig. 4(c)).
Model  Abalone 

VEADGP  
GP  888.96 78.22 
Lin. Reg.  917.31 53.76 
Model  Creep 
VEADGP  
GP  602.11 29.59 
Lin. Reg.  1865.76 23.36 
5.3 Supervised Learning and Bayesian Optimization
In this section we consider two supervised learning problem instances: regression and Bayesian optimization (BO) (Osborne, 2010; Snoek et al., 2012b). We demonstrate the utility of VEADGP in these settings by evaluating its performance in terms of predictive accuracy and predictive uncertainty quantification. For these experiments we use a VEADGP with one hidden layer (and one observed inputs layer) and exponentiated quadratic covariance functions. Furthermore, we incorporate the deep GP modification of Duvenaud et al. (2014) so that the observed input layer has an additional connection to the output layer. Duvenaud et al. (2014) showed that this modification increases the general stability of the method. Since the sample size of the data considered for supervised learning is relatively small, we do not use the recognition model to backconstrain the variational distributions.
In the regression experiments we use the Abalone
dataset ( dimensional outputs and dimensional inputs) from UCI and the Creep
dataset ( dimensional outputs and dimensional inputs) from (Cole et al., 2000). A typical split for this data is to use (Abalone) and (Creep) instaces for training. We used inducing inputs for each layer and performed 4 runs with different random splits. We summarize the results in Table 2.
Next, we show how the VAEDGP can be used in the context of probabilistic numerics, in particular for Bayesian optimization (BO) (Osborne, 2010; Snoek et al., 2012b). In BO, the goal is to find for where a limited number of evaluations are available. Typically, a GP is used to fit the available data, as a surrogate model. The GP is iteratively updated with new function evaluations and used to build an acquisition function able to guide the collection of new observations of . This is done by balancing exploration (regions with large uncertainty) and exploitation (regions with a low mean). In BO, the model is a crucial element of the process: it should be able to express complex classes of functions and to provide coherent estimates of the function uncertainty. In this experiment we use the nonstationary Branin function^{5}^{5}5See http://www.sfu.ca/~ssurjano/optimization.html for details. The default domain is in the experiments. to compare the performance of standard GPs and the VEADPP in the context of BO. We used the popular expected improvement (Jones et al., 1998) acquisition function and we ran 10 replicates of the experiment using different initializations, each kickingoff optimization with 3 points randomly selected from the functions’ domain . In each replicate we iteratively collected 20 evaluations of the functions. In the VEADPP we used 30 inducing points. Figure 7 shows that using the VEADPP as a surrogate model results in a gain, especially in the first steps of the optmization. This is due to ability of VEADPP to deal with the nonstationary components of the function and to model a much richer class of distributions (e.g. multimodal) in the output layer (as opposed to the standard GP which assumes joint Gaussianity in the outputs).
6 Conclusion
We have proposed a new deep nonparametric generative model. Although general enough to be used in supervised and unsupervised problems, we especially highlighted its usefulness in the latter scenario, a case which is known to be a major challenge for current deep machine learning approaches. Our model is based on a deep Gaussian process, which we extended with a layerwise parameterization through multilayer perceptrons, significantly simplifying optimization. Additionally, we developed a new formulation of the lower bound that allows for distributed computations. Overall, our approach is able to perform Bayesian inference using large datasets and compete with current alternatives. Future developments include the regularization of the perceptron weights, to reformulate the current setup for the context of multiview problems and to incorporate convolutional structures into the objective function.
acknowledgement. The authors thank the financial support of RADIANT (EU FP7HEALTH Project Ref 305626), BBSRC Project No BB/K011197/1 and WYSIWYD (EU FP7ICT Project Ref 612139).
References
 Bengio et al. (2013) Bengio, Yoshua, Mesnil, Grégoire, Dauphin, Yann, and Rifai, Salah. Better Mixing via Deep Representations. In International Conference on Machine Learning, 2013.
 Bengio et al. (2014) Bengio, Yoshua, Laufer, Eric, Alain, Guillaume, and Yosinski, Jason. Deep Generative Stochastic Networks Trainable by Backprop. In International Conference on Machine Learning, 2014.
 Blundell et al. (2015) Blundell, Charles, Cornebise, Julien, Kavukcuoglu, Koray, and Wierstra, Daan. Weight Uncertainty in Neural Networks. In International Conference on Machine Learning, 2015.

Bui et al. (2015)
Bui, Thang D., HernándezLobato, José Miguel, Li, Yingzhen,
HernándezLobato, Daniel, and Turner, Richard E.
Training Deep Gaussian Processes using Stochastic Expectation Propagation and Probabilistic Backpropagation.
In Workshop on Advances in Approximate Bayesian Inference, NIPS, 2015.  Calandra et al. (2014) Calandra, Roberto, Peters, Jan, Rasmussen, Carl Edward, and Deisenroth, Marc Peter. Manifold Gaussian processes for regression. Technical report, 2014.
 Cole et al. (2000) Cole, D, MartinMoran, C, Sheard, AG, Bhadeshia, HKDH, and MacKay, DJC. Modelling creep rupture strength of ferritic steel welds. Science and Technology of Welding & Joining, 5(2):81–89, 2000.
 Dai et al. (2014) Dai, Zhenwen, Damianou, Andreas, Hensman, James, and Lawrence, Neil. Gaussian process models with parallelization and GPU acceleration, 2014.
 Damianou (2015) Damianou, Andreas. Deep Gaussian processes and variational propagation of uncertainty. PhD Thesis, University of Sheffield, 2015.

Damianou & Lawrence (2015)
Damianou, Andreas and Lawrence, Neil.
Semidescribed and semisupervised learning with Gaussian processes.
In31st Conference on Uncertainty in Artificial Intelligence (UAI)
, 2015.  Damianou & Lawrence (2013) Damianou, Andreas and Lawrence, Neil D. Deep Gaussian processes. In Carvalho, Carlos and Ravikumar, Pradeep (eds.), Proceedings of the Sixteenth International Workshop on Artificial Intelligence and Statistics, volume 31, pp. 207–215, AZ, USA, 4 2013. JMLR W&CP 31.
 Damianou et al. (2011) Damianou, Andreas, Titsias, Michalis K., and Lawrence, Neil D. Variational Gaussian process dynamical systems. In Bartlett, Peter, Peirrera, Fernando, Williams, Chris, and Lafferty, John (eds.), Advances in Neural Information Processing Systems, volume 24, Cambridge, MA, 2011. MIT Press.
 Dasgupta & McAllester (2013) Dasgupta, Sanjoy and McAllester, David (eds.). Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 1621 June 2013, volume 28 of JMLR Proceedings, 2013. JMLR.org.
 Durrande et al. (2011) Durrande, N, Ginsbourger, D, and Roustant, O. Additive kernels for Gaussian process modeling. Technical report, 2011.
 Duvenaud et al. (2013) Duvenaud, David, Lloyd, James Robert, Grosse, Roger, Tenenbaum, Joshua B., and Ghahramani, Zoubin. Structure discovery in nonparametric regression through compositional kernel search. In Dasgupta & McAllester (2013), pp. 1166–1174.
 Duvenaud et al. (2014) Duvenaud, David, Rippel, Oren, Adams, Ryan, and Ghahramani, Zoubin. Avoiding pathologies in very deep networks. In Kaski, Sami and Corander, Jukka (eds.), Proceedings of the Seventeenth International Workshop on Artificial Intelligence and Statistics, volume 33, Iceland, 2014. JMLR W&CP 33.
 Ek et al. (2008) Ek, Carl Henrik, Rihan, Jon, Torr, Philip, Rogez, Gregory, and Lawrence, Neil D. Ambiguity modeling in latent spaces. In PopescuBelis, Andrei and Stiefelhagen, Rainer (eds.), Machine Learning for Multimodal Interaction (MLMI 2008), LNCS, pp. 62–73. SpringerVerlag, 28–30 June 2008.
 Gal & Ghahramani (2015) Gal, Yarin and Ghahramani, Zoubin. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. arXiv:1506.02142, 2015.
 Gal et al. (2014) Gal, Yarin, van der Wilk, Mark, and Rasmussen, Carl E. Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models. In Advances in Neural Information Processing System, 2014.
 Gönen & Alpaydin (2011) Gönen, Mehmet and Alpaydin, Ethem. Multiple kernel learning algorithms. Journal of Machine Learning Research, 12:2211–2268, Jul 2011.
 Goodfellow et al. (2014) Goodfellow, Ian, PougetAbadie, Jean, Mirza, Mehdi, Xu, Bing, WardeFarley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative Adversarial Networks. In Advances in Neural Information Processing Systems, 2014.
 Hensman & Lawrence (2014) Hensman, James and Lawrence, Neil D. Nested variational compression in deep Gaussian processes. Technical report, University of Sheffield, 2014.
 Hensman et al. (2013) Hensman, James, Lawrence, Neil D., and Rattray, Magnus. Hierarchical Bayesian modelling of gene expression time series across irregularly sampled replicates and clusters. BMC Bioinformatics, 14(252), 2013. doi: doi:10.1186/1471210514252.
 Hinton et al. (2012) Hinton, Geoffrey E., Srivastava, Nitish, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan R. Improving neural networks by preventing coadaptation of feature detectors. arXiv: 1207.0580, 2012.
 Jones et al. (1998) Jones, Donald R., Schonlau, Matthias, and Welch, William J. Efficient global optimization of expensive blackbox functions. Journal of Global Optimization, 13(4):455–492, 1998.
 Kingma & Welling (2013) Kingma, Diederik P and Welling, Max. AutoEncoding Variational Bayes. In ICLR, 2013.
 Kingma et al. (2015) Kingma, Diederik P., Salimans, Tim, and Welling, Max. Variational Dropout and the Local Reparameterization Trick. In Advances in Neural Information Processing System, 2015.
 Lawrence & Moore (2007) Lawrence, Neil D. and Moore, Andrew J. Hierarchical Gaussian process latent variable models. In Ghahramani, Zoubin (ed.), Proceedings of the International Conference in Machine Learning, volume 24, pp. 481–488. Omnipress, 2007. ISBN 1595937933.
 Lawrence & Quiñonero Candela (2006) Lawrence, Neil D. and Quiñonero Candela, Joaquin. Local distance preservation in the GPLVM through back constraints. In Cohen, William and Moore, Andrew (eds.), Proceedings of the International Conference in Machine Learning, volume 23, pp. 513–520. Omnipress, 2006. ISBN 1595933832. doi: 10.1145/1143844.1143909.
 LázaroGredilla (2012) LázaroGredilla, Miguel. Bayesian warped Gaussian processes. In Bartlett, Peter L., Pereira, Fernando C. N., Burges, Christopher J. C., Bottou, Léon, and Weinberger, Kilian Q. (eds.), Advances in Neural Information Processing Systems, volume 25, Cambridge, MA, 2012.
 LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.

Li et al. (2015)
Li, Yujia, Swersky, Kevin, and Zemel, Richard.
Generative Moment Matching Networks.
In International Conference on Machine Learning, 2015.  Mnih & Gregor (2014) Mnih, A. and Gregor, K. Neural Variational Inference and Learning in Belief Networks. In International Conference on Machine Learning, 2014.
 Netzer et al. (2011) Netzer, Yuval, Wang, Tao, Coates, Adam, Bissacco, Alessandro, Wu, Bo, and Ng, Andrew Y. Reading Digits in Natural Images with Unsupervised Feature Learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
 Osborne (2010) Osborne, Michael. Bayesian Gaussian Processes for Sequential Prediction, Optimisation and Quadrature. PhD thesis, University of Oxford, 2010.
 Rezende et al. (2014) Rezende, D J, Mohamed, S, and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014.
 Salakhutdinov & Hinton (2008) Salakhutdinov, Ruslan and Hinton, Geoffrey. Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes. In Advances in Neural Information Processing Systems, volume 20, 2008.
 Snelson et al. (2004) Snelson, Edward, Rasmussen, Carl Edward, and Ghahramani, Zoubin. Warped Gaussian processes. In Thrun, Sebastian, Saul, Lawrence, and Schölkopf, Bernhard (eds.), Advances in Neural Information Processing Systems, volume 16, Cambridge, MA, 2004. MIT Press.

Snoek et al. (2012a)
Snoek, Jasper, Adams, Ryan P., and Larochelle, Hugo.
Nonparametric Guidance of Autoencoder Representations using Label Information.
Journal of Machine Learning Research, 13:2567–2588, 2012a.  Snoek et al. (2012b) Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. Practical Bayesian optimization of machine learning algorithms, pp. 2951–2959. 2012b.
 Titsias & Lawrence (2010) Titsias, Michalis K. and Lawrence, Neil D. Bayesian Gaussian process latent variable model. In Teh, Yee Whye and Titterington, D. Michael (eds.), Proceedings of the Thirteenth International Workshop on Artificial Intelligence and Statistics, volume 9, pp. 844–851, Chia Laguna Resort, Sardinia, Italy, 1316 May 2010. JMLR W&CP 9.
 Uria et al. (2014) Uria, Benigno, Murray, Iain, and Larochelle, Hugo. A Deep and Tractable Density Estimator. In International Conference on Machine Learning, 2014.
 Wilson & Adams (2013) Wilson, Andrew Gordon and Adams, Ryan Prescott. Gaussian process kernels for pattern discovery and extrapolation. In Dasgupta & McAllester (2013), pp. 1067–1075.
Comments
There are no comments yet.