1 Introduction
Neural networks (NNs) have achieved stateoftheart results in a wide variety of supervised learning tasks, such as image recognition
(Krizhevsky et al., 2012; Simonyan and Zisserman, 2014; Russakovsky et al., 2015), speech recognition (Seide et al., 2011; Hinton et al., 2012; Dahl et al., 2012) and machine translation (Bahdanau et al., 2014; Cho et al., 2014). However, NNs have a major drawback in that output uncertainty is not well estimated. NNs give point estimates of outputs at test inputs.Estimating the uncertainty of the output is important in various situations. First, the uncertainty can be used for rejecting the results. In realworld applications such as medical diagnosis, we should avoid automatic decision making with the difficult examples, and ask human experts or conduct other examinations to achieve high reliability. Second, the uncertainty can be used to calculate risk. In some domains, it is important to be able to estimate the probability of critical issues occurring, for example with selfdriving cars or nuclear power plant systems. Third, the uncertainty can be used for the inputs of other machine learning tasks. For example, uncertainty of speech recognition results helps in terms of improving machine translation performance in automatic speech translation systems
(Ney, 1999). The uncertainty would also be helpful for active learning
(Krause and Guestrin, 2007)(Blundell et al., 2015).We propose a simple method that makes it possible for NNs to estimate output uncertainty. With the proposed method, NNs are used for the mean functions of Gaussian processes (GPs) (Rasmussen and Williams, 2006)
. GPs are used as prior distributions over smooth nonlinear functions, and the uncertainty of the output can be estimated with Bayesian inference. GPs perform well in various regression and classification tasks
(Williams and Rasmussen, 1996; Barber and Williams, 1997; NaishGuzman and Holden, 2007; Nickisch and Rasmussen, 2008).Combining NNs and GPs gives us another advantage. GPs exploit local generalization, where generalization is achieved by local interpolation between neighbors
(Bengio et al., 2013). Therefore, GPs can adjust target functions rapidly in the presense of training data, but fail to generalize in regions where there are no training data. On the other hand, NNs have good generalization capability for unseen input configurations by learning multiple levels of distributed representations, but require a huge number of training data. Since GPs and NNs achieve generalization in different ways, the proposed method can improve generalization performance by adopting both of their advantages.
Zero mean functions are usually used since GPs with zero mean functions and some specific kernels can approximate an arbitrary continuous function given enough training data (Micchelli et al., 2006). However, GPs with zero mean functions predict zero outputs far from training samples. Figure 1(a) shows the predictions of GPs with zero mean functions and RBF kernels. When trained with two samples, the prediction values are close to the true values if there are training samples, but far from the true values if there are none. On the other hand, when GPs with appropriate nonzero mean functions are used as in Figure 1(b), the prediction approximates the true values even when there are no training samples. Figure 1 shows that GPs rapidly adjust the prediction when there are training data regardless of the mean function values.
The proposed method gives NNs more flexibility via GPs. In general, the risk of overfitting increases as the model flexibility increases. However, since the proposed method is based on Bayesian inference, where nonlinear functions with GP priors are integrated out, the proposed method can help alleviate overfitting.
To retain the high generalization capability of NNs with the proposed method, large training data are required. The computational complexity of the exact inference of GPs is cubic in the number of training samples, which is prohibitive for large data. We present a scalable stochastic inference procedure for the proposed method, where sparse GPs are inferred by stochastic variational inference (Hensman et al., 2013), and NN parameters and kernel parameters are estimated by stochastic gradient descent methods, simultaneously. By using stochastic optimization, the parameters are updated efficiently without analyzing all the data at each iteration, where a noisy estimate of the gradient of the objective function is used. The inference algorithm also enables us to handle massive data even when they cannot be stored in a memory.
no training  trained w/ two samples  no training  trained w/ two samples 
(a) GP with zero mean functions  (b) GP with nonzero mean functions 
True values (red), mean function values (green) and prediction values (blue) provided by GPs with zero mean functions (a) and GPs with nonzero mean functions (b). The blue area is the 95% confidence interval of the prediction, and the red points indicate the training samples.
The remainder of this paper is organized as follows. In Section 2, we outline related work. In Section 3, we propose a simple approach for combining NNs and GPs, and describe its scalable inference procedure. In Section 4, we demonstrate the effectiveness of the proposed method by using two realworld spatiotemporal data sets in terms of uncertainty estimate and point estimate performance. Finally, we present concluding remarks and a discussion of future work in Section 5.
2 Related work
Bayesian NNs are the most common way of introducing uncertainty into NNs, where distributions over the NN parameters are inferred. A number of Bayesian NN methods have been proposed including Laplace approximation (MacKay, 1992), Hamiltonian Monte Carlo (Neal, 1995), variational inference (Hinton and Van Camp, 1993; Graves, 2011; Blundell et al., 2015; Louizos and Welling, 2016; Sun et al., 2017), expectation propagation (Jylänki et al., 2014)
, stochastic backpropagation
(HernándezLobato and Adams, 2015), and dropout (Kingma et al., 2015; Gal and Ghahramani, 2016) methods. Our proposed method gives the output uncertainty of NNs with a different approach, where we conduct point estimation for the NN parameters, but the NNs are combined with GPs. Therefore, the proposed method incorporates the high generalization performance of NNs and the high flexibility of GPs, and can handle largescale data by using scalable NN stochastic optimization and GP stochastic variational inference.Although zero mean functions are often used for GPs, nonzero mean functions, such as polynomial functions (Blight and Ott, 1975), have also been used. When the mean functions are linear in the parameters, the parameters can be integrated out, which leads to another GP (O’Hagan, 1978). However, scalable inference algorithms for GPs with flexible nonlinear mean functions like NNs have not been proposed.
NNs and GPs are closely related. NNs with a hidden layer converge to GPs in the limit of an infinite number of hidden units (Neal, 1995). A number of methods combining NNs and GPs have been proposed. Deep GPs (Damianou and Lawrence, 2013) use GPs for each layer in NNs, where local generalization is exploited since their inputs are kernel values. GP regression networks (Wilson et al., 2012) combine the structural properties of NNs with the nonparametric flexibility of GPs for accommodating input dependent signal and noise correlations. Manifold GPs (Calandra et al., 2016) and deep NN based GPs (Huang et al., 2015) use NNs for transforming the input features of GPs. Deep kernel learning (Wilson et al., 2016) uses NNs to learn kernels for GPs. The proposed method is different from these methods since it incorporates the outputs of NNs into GPs.
3 Proposed method
Suppose that we have a set of input and output pairs, , where is the th input, and is its output. Output is assumed to be generated by a nonlinear function with Gaussian noise. Let
be the vector of function values on the observed inputs,
. Then, the probability of the output is given by(1) 
where is the observation precision parameter. For the nonlinear function, we assume a GP model,
(2) 
where is the mean function with parameters , and is the kernel function with kernel parameters . We use a NN for the mean function, and we call (2) NeuGaP, which is a simple and new model that fills a gap between the GP and NN literatures. By integrating out the nonlinear function , the likelihood is given by
(3) 
where , is the covariance matrix defined by the kernel function , and is the vector of the output values of the NN on the observed inputs. The parameters in GPs are usually estimated by maximizing the marginal likelihood (3). However, the exact inference is infeasible for large data since the computational complexity is due to the inversion of the covariance matrix.
To reduce the computational complexity while keeping the desirable properties of GPs, we employ sparse GPs (Snelson and Ghahramani, 2006; QuiñoneroCandela and Rasmussen, 2005; Hensman et al., 2013). With a sparse GP, inducing inputs , , and their outputs , , are introduced. The basic idea behind sparse inducing point methods is that when the number of inducing points , computation can be reduced in . The inducing outputs are assumed to be generated by the nonlinear function of NeuGaP (2) taking the inducing inputs as inputs. By marginalizing out the nonlinear function, the probability of the inducing outputs is given by
(4) 
where is the vector of the NN output values on the inducing inputs, is the covariance matrix evaluated between all the inducing inputs, . The output values at the observed inputs are assumed to be conditionally independent of each other given the inducing outputs , then we have
(5) 
where
(6) 
Here, is the dimensional column vector of the covariance function evaluated between observed and inducing inputs, . Equation (5
) is obtained in the same way as the derivation of the predictive mean and variance of test data points in standard GPs.
The lower bound of the log marginal likelihood of the sparse GP to be maximized is
(7) 
where is the variational distribution of the inducing points, and Jensen’s inequality is applied (Titsias, 2009). The log likelihood of the observed output given the inducing points is as follows,
(8) 
where Jensen’s inequality is applied, and the lower bound of is decomposed into terms for each training sample. By using (7) and (8), the lower bound of is given by
(9) 
where
(10) 
and is the KL divergence between two Gaussians, which is calculated by
(11) 
The NN parameters and kernel parameters are updated efficiently by maximizing the lower bound (9) using stochastic gradient descent methods. The parameters in the variational distribution, and , are updated efficiently by using stochastic variational inference (Hoffman et al., 2013). We altenately iterate the stochastic gradient descent and stochastic variational inference for each minibatch of training data.
With stochastic variational inference, the parameters of variational distributions are updated based on the natural gradients (Amari, 1998), which are computed by multiplying the gradients by the inverse of the Fisher information matrix. The natural gradients provide faster convergence than standard gradients by taking account of the information geometry of the parameters. In the exponential family, the natural gradients with respect to natural parameters correspond to the gradients with respect to expectation parameters (Hensman et al., 2012). The natural parameters of Gaussian are and . Its expectation parameters are and . We take a step in the natural gradient direction by employing , where is the natural gradient of the objective function with respect to the natural parameter, is the Fisher information, and is the step length at iteration . The update rules for the proposed model are given by
(12) 
(13) 
We can use minibatches instead of a single training sample to update the natural parameters.
The output distribution given test input is calculated by
(14) 
where is the covariance function column vector evaluated between test input and inducing inputs, and .
4 Experiments
Data
We evaluated our proposed method by using two realworld spatiotemporal data sets. The first data set is the Comprehensive Climate (CC) data set ^{1}^{1}1http://wwwbcf.usc.edu/~liu32/data/NA199020002Monthly.csv, which consists of monthly climate reports for North America (Bahadori et al., 2014; Lozano et al., 2009). We used 19 variables for 1990, such as month, latitude, longitude, carbon dioxide and temperature, which were interpolated on a degree grid with 125 locations. The second data set is the U.S. Historical Climatology Network (USHCN) data set ^{2}^{2}2https://www.ncdc.noaa.gov/oa/climate/research/ushcn/, which consists of monthly climate reports at 1218 locations in U.S. for 1990. We used the following seven variables: month, latitude, longitude, elevation, precipitation, minimum temperature, and maximum temperature.
The task was to estimate the distribution of a variable given the values of the other variables as inputs; there were 19 tasks in CC data, and seven tasks in USHCN data. We evaluated the performance in terms of test log likelihoods. We also used mean squared errors to evaluate point estimate performance. We randomly selected some locations as test data. The remaining data points were randomly split into 90% training data and 10% validation data. With CC data, we used 20%, 50% and 80% of locations as test data, and their training data sizes were 1081, 657 and 271, respectively. With USHCN data, we used 50%, 90% and 95% of locations as test data, and their training data sizes were 6597, 1358 and 609, respectively.
Comparing Methods
We compared the proposed method with GPs and NNs. The GPs were sparse GPs inferred by stochastic variational inference. The GPs correspond to the proposed method with a zero mean function. With the proposed method and GPs, we used the following RBF kernels for the kernel function,
, and 100 inducing points. We set the step size at epoch
as for the stochastic variational inference. With the NNs, we used threelayer feedforward NNs with five hidden units, and we optimized the NN parameters and precision parameter by maximizing the following likelihood, , by using Adam (Kingma and Ba, 2014). The proposed method used NNs with the same structure for the mean function, where the NN parameters were first optimized by maximizing the likelihood, and then variational, kernel and NN parameters were estimated by maximizing the variational lower bound (9) using the stochastic variational inference and Adam. The locations of the inducing inputs were initialized by means results. For all the methods, we set the minibatch size at 64, and used early stoping based on the likelihood on a validation set.Results
Tables 1 and 2 show the test log likelihoods with different missing value rates with CC and USHCN data, respectively. The proposed method achieved the highest average likelihoods with both data sets. The NN performed poorly when many values were missing (Table 1, 80% missing). On the other hand, since a GP is a nonparametric Bayesian method, where the effective model complexity is automatically adjusted depending on the number of training samples, the GPs performed better than the NNs with the many missing value data. When the number of missing values was small (Table 1, 20% missing), the NN performed better than the GP. The proposed method achieved the best performance with different missing value rates by combining the advantages of NNs and GPs. Tables 3 and 4 show the test mean squared errors with different missing value rates. The proposed method achieved the lowest average errors with both data sets. This result indicates that combining NNs and GPs also helps to improve the generalization performance. Table 5 shows the computational time in seconds.
Figure 2 shows the prediction with its confidence interval obtained by the proposed method, GP and NN. The NN gives fixed confidence intervals at all test points, and some true values are located outside the confidence intervals. On the other hand, the proposed method flexibly changes confidence intervals depending on the test points. The confidence intervals with the GP differ across different test points as with the proposed method. However, they are wider than those of the proposed method, since the mean functions are fixed at zero.
Missing  20%  50%  80%  

Method  Proposed  GP  NN  Proposed  GP  NN  Proposed  GP  NN 
MON 

LAT 

LON 

CO2 

CH4 

CO 

H2 

WET 

CLD 

VAP 

PRE 

FRS 

DTR 

TMN 

TMP 

TMX 

GLO 

ETR 

ETRN 

Average 
Test log likelihoods provided by the proposed method, GP, and NN with CC data. The bottom row shows the values averaged over all variables. Values in a bold typeface are statistically better (at the 5% level) than those in normal typeface as indicated by a paired ttest.
Missing  50%  90%  95%  

Method  Proposed  GP  NN  Proposed  GP  NN  Proposed  GP  NN 
MON 

LAT 

LON 

ELE 

PRE 

TMIN 

TMAX 

Average 
Missing  20%  50%  80%  

Method  Proposed  GP  NN  Proposed  GP  NN  Proposed  GP  NN 
MON 

LAT 

LON 

CO2 

CH4 

CO 

H2 

WET 

CLD 

VAP 

PRE 

FRS 

DTR 

TMN 

TMP 

TMX 

GLO 

ETR 

ETRN 

Average 
Missing  50%  90%  95%  

Method  Proposed  GP  NN  Proposed  GP  NN  Proposed  GP  NN 
MON 

LAT 

LON 

ELE 

PRE 

TMIN 

TMAX 

Average 
Data  CC  USHCN  

Missing  20%  50%  80%  50%  90%  95% 
Proposed  
GP  
NN 
(a) Proposed  (b) GP  (c) NN 
5 Conclusion
In this paper, we proposed a simple method for combining neural networks and Gaussian processes. With the proposed method, neural networks are used as the mean function of Gaussian processes. We present a scalable learning procedure based on stochastic gradient descent and stochastic variational inference. With experiments using two realworld spatiotemporal data sets, we demonstrated that the proposed method achieved better uncertainty estimation and generalization performance than neural networks and Gaussian processes. There are several avenues that can be pursed as future work. In our experiments, we used feedforward neural networks. We would like to use other types of neural networks, such as convolutional and recurrent neural networks. Moreover, we plan to analyze the sensitivity with respect to the structure of the neural networks, the number of inducing points and the choice of kernels. Finally, the mean function of neural networks could be inferred using Bayesian methods.
References
 Amari [1998] S.I. Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.

Bahadori et al. [2014]
M. T. Bahadori, Q. R. Yu, and Y. Liu.
Fast multivariate spatiotemporal analysis via low rank tensor learning.
In Advances in Neural Information Processing Systems, pages 3491–3499, 2014.  Bahdanau et al. [2014] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 Barber and Williams [1997] D. Barber and C. K. Williams. Gaussian processes for Bayesian classification via hybrid Monte Carlo. In Advances in Neural Information Processing Systems, pages 340–346, 1997.
 Bengio et al. [2013] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013.
 Blight and Ott [1975] B. Blight and L. Ott. A Bayesian approach to model inadequacy for polynomial regression. Biometrika, 62(1):79–88, 1975.
 Blundell et al. [2015] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural network. In Proceedings of the 32nd International Conference on Machine Learning, pages 1613–1622, 2015.
 Calandra et al. [2016] R. Calandra, J. Peters, C. E. Rasmussen, and M. P. Deisenroth. Manifold Gaussian processes for regression. In Proceedings of the International Joint Conference on Neural Networks, pages 3338–3345. IEEE, 2016.

Cho et al. [2014]
K. Cho, B. v. M. C. Gulcehre, D. Bahdanau, F. B. H. Schwenk, and Y. Bengio.
Learning phrase representations using RNN encoder–decoder for
statistical machine translation.
In
Proceedings of the Conference on Empirical Methods in Natural Language Processing
, pages 1724–1734, 2014.  Dahl et al. [2012] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Contextdependent pretrained deep neural networks for largevocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):30–42, 2012.
 Damianou and Lawrence [2013] A. Damianou and N. Lawrence. Deep Gaussian processes. In Artificial Intelligence and Statistics, pages 207–215, 2013.

Gal and Ghahramani [2016]
Y. Gal and Z. Ghahramani.
Dropout as a Bayesian approximation: Representing model uncertainty in deep learning.
In Proceedings of The 33rd International Conference on Machine Learning, pages 1050–1059, 2016.  Graves [2011] A. Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pages 2348–2356, 2011.
 Hensman et al. [2012] J. Hensman, M. Rattray, and N. D. Lawrence. Fast variational inference in the conjugate exponential family. In Advances in Neural Information Processing Systems, pages 2888–2896, 2012.
 Hensman et al. [2013] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian processes for big data. In Proceedings of the TwentyNinth Conference on Uncertainty in Artificial Intelligence, pages 282–290. AUAI Press, 2013.
 HernándezLobato and Adams [2015] J. M. HernándezLobato and R. Adams. Probabilistic backpropagation for scalable learning of Bayesian neural networks. In ICML, pages 1861–1869, 2015.
 Hinton et al. [2012] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.

Hinton and Van Camp [1993]
G. E. Hinton and D. Van Camp.
Keeping the neural networks simple by minimizing the description
length of the weights.
In
Proceedings of the sixth annual conference on Computational learning theory
, pages 5–13. ACM, 1993.  Hoffman et al. [2013] M. D. Hoffman, D. M. Blei, C. Wang, and J. W. Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303–1347, 2013.
 Huang et al. [2015] W. Huang, D. Zhao, F. Sun, H. Liu, and E. Chang. Scalable Gaussian process regression using deep neural networks. In Proceedings of the 24th International Joint Conference on Artificial Intelligence, pages 3576–3582, 2015.
 Jylänki et al. [2014] P. Jylänki, A. Nummenmaa, and A. Vehtari. Expectation propagation for neural networks with sparsitypromoting priors. Journal of Machine Learning Research, 15(1):1849–1901, 2014.
 Kingma and Ba [2014] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma et al. [2015] D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583, 2015.
 Krause and Guestrin [2007] A. Krause and C. Guestrin. Nonmyopic active learning of Gaussian processes: an explorationexploitation approach. In Proceedings of the 24th International Conference on Machine Learning, pages 449–456. ACM, 2007.
 Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
 Louizos and Welling [2016] C. Louizos and M. Welling. Structured and efficient variational deep learning with matrix gaussian posteriors. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, pages 1708–1716, 2016.
 Lozano et al. [2009] A. C. Lozano, H. Li, A. NiculescuMizil, Y. Liu, C. Perlich, J. Hosking, and N. Abe. Spatialtemporal causal modeling for climate change attribution. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 587–596. ACM, 2009.
 MacKay [1992] D. J. MacKay. A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3):448–472, 1992.
 Micchelli et al. [2006] C. A. Micchelli, Y. Xu, and H. Zhang. Universal kernels. Journal of Machine Learning Research, 7(Dec):2651–2667, 2006.
 NaishGuzman and Holden [2007] A. NaishGuzman and S. B. Holden. The generalized FITC approximation. In Advances in Neural Information Processing Systems, pages 1057–1064, 2007.
 Neal [1995] R. M. Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.
 Ney [1999] H. Ney. Speech translation: Coupling of recognition and translation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 517–520. IEEE, 1999.
 Nickisch and Rasmussen [2008] H. Nickisch and C. E. Rasmussen. Approximations for binary Gaussian process classification. Journal of Machine Learning Research, 9(Oct):2035–2078, 2008.
 O’Hagan [1978] A. O’Hagan. Curve fitting and optimal design for prediction. Journal of the Royal Statistical Society. Series B, 1:1–42, 1978.
 QuiñoneroCandela and Rasmussen [2005] J. QuiñoneroCandela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005.
 Rasmussen and Williams [2006] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.

Russakovsky et al. [2015]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein, et al.
Imagenet large scale visual recognition challenge.
International Journal of Computer Vision
, 115(3):211–252, 2015.  Seide et al. [2011] F. Seide, G. Li, and D. Yu. Conversational speech transcription using contextdependent deep neural networks. In Interspeech, pages 437–440, 2011.
 Simonyan and Zisserman [2014] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 Snelson and Ghahramani [2006] E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudoinputs. In Advances in Neural Information Processing Systems, volume 18, pages 1257–1264, 2006.
 Sun et al. [2017] S. Sun, C. Chen, and L. Carin. Learning structured weight uncertainty in Bayesian neural networks. In Artificial Intelligence and Statistics, pages 1283–1292, 2017.
 Titsias [2009] M. K. Titsias. Variational learning of inducing variables in sparse gaussian processes. In AISTATS, volume 5, pages 567–574, 2009.
 Williams and Rasmussen [1996] C. K. Williams and C. E. Rasmussen. Gaussian processes for regression. In Advances in Neural Information Processing Systems, pages 514–520, 1996.
 Wilson et al. [2012] A. G. Wilson, Z. Ghahramani, and D. A. Knowles. Gaussian process regression networks. In Proceedings of the 29th International Conference on Machine Learning, pages 599–606, 2012.
 Wilson et al. [2016] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep kernel learning. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 370–378, 2016.