1 Introduction
For centuries, scientists and engineers have worked on creating mathematical abstractions of real world systems. This principled modeling approach provides a powerful toolbox to derive whitebox models that we can use to understand and analyze physical systems. However, as the complexity of physical systems grow, deriving detailed principled models becomes an expensive and tedious task that requires highly experienced scientists and engineers. Moreover, incorrect assumptions leads to inaccurate models that are unable to represent the real system.
Datadriven blackbox models provide an appealing alternative modeling approach that requires little to none domain knowledge. These models are fit to data extracted from the real system, minimizing the problems derived from incorrect assumptions. However, using datadriven models while completely ignoring domain knowledge may lead to models that do not generalize well and are hard to understand. Completely blackbox approaches ignore the structure of the problem, wasting resources szegedy2015going and making the model less explainable adadi2018peeking .
Graybox models combine domain knowledge and data as both provide important and complementary information about the system. Domain knowledge can be used to construct a set of basic assumptions about the system, giving the datadriven model a baseline to build upon. Data can be used to fill the gaps in knowledge and model complex relations that were not considered by the domain expert.
In this paper, we explore an approach for embedding domain knowledge into a datadriven model in order to improve generalization and interpretability. The presented graybox model, which we called EVGP (Explicit Variational Gaussian Process), is a scalable approximation of a sparse Gaussian Process (GP) that uses domain knowledge to define the prior distribution of the GP. In this paper domain knowledge is extracted from physicsbased knowledge, however the EVGP can be applied to any domain.
The work on this paper has three corner stones (Fig. 2
): 1) Gaussian processes are used for learning complex nonlinear behavior from data and model uncertainty, 2) Partial domain knowledge is used as prior in order to improve inductive bias, 3) Variational Inference is used to find an approximation that scales well to large datasets. Inductive bias refers to the assumptions made by the model when doing predictions over inputs that have not been observed. The presented approach provides uncertainty estimations which are fundamental in order to avoid the risk associated with overconfidence in unexplored areas
frigola2014variational and warns the user of possible incorrect estimations marino2017data .The work in this paper is highly applicable when: 1) modeling physical systems with uncertainty estimations, 2) partial domain knowledge of the system is available, 3) large quantities of data are available. The aim is to help the engineer and take advantage of available knowledge without requiring the derivation of complex and detailed models. Instead, an engineer only has to provide simple, partially formed, models and the EVGP takes care of filling the gaps in knowledge. We show how the EVGP model can be used to learn system dynamics using basic physics laws as prior knowledge.
2 Variational GP using explicit features
The EVGP model is designed to solve regression problems under uncertainty. Given a dataset composed of input/output pairs of samples , we would like to obtain a predictive distribution that estimates the value of the output for a given input . The EVGP model approximates using Variational Inference. The EVGP is defined as a distribution where are a set of parameters with prior distribution .
In the following sections we describe in detail: A) the EVGP model, B)
the variational loss function used to train the model,
C) the predictive distribution that approximates .2.1 Model Definition
The EVGP model takes the following form:
(1) 
where is a Gaussian process with kernel , is the observation noise and is the denoised prediction for the input . Figure 2 offers a visual representation of the model. The following is the description of the main components of the EVGP model:

Domain knowledge is embedded in the explicit function , parameterized by . The function describes a set of explicit features (hence the name of our method) provided by the domain expert.
is modeled using a normal distribution with a prior that is also extracted from domain knowledge. In this paper,
is derived from partially defined Neutonian mechanics. 
The Gaussian Process adds the ability to learn complex nonlinear relations that is unable to capture.
Given a dataset , the exact predictive distribution for the model in Eq. (1) is described in rasmussen2004gaussian . For the rest of the paper, we refer to the exact distribution as EGP. Computing the EGP predictive distribution has a large computational cost and does not scale well for large datasets. To alleviate this problem, sparse approximation methods quinonero2005unifying use a small set of inducing points instead of the entire dataset to approximate the predictive distribution.
In order to construct a sparse approximation for the model in Eq. (1), we use a set of inducing points as parameters that will be learned from data. Given and a set of test points , the prior distribution of the model can be expressed as follows:
where denotes the data matrix, where each row represents an individual sample. The rows of represent the value of the function applied to the real samples . The rows of represent the value of the function applied to the inducing (learned) points
. Using the conditional rule for multivariate Gaussian distributions, we obtain the definition of the denoised sparse EVGP model:
(2) 
where are the parameters of our model, , and . Equation (2) defines our scalable EVGP model. In following sections we give prior distributions to the parameters
and perform approximate Bayesian inference. Note that Eq. (
2) is also conditioned on , however we do not indicate this explicitly in order to improve readability.2.2 Variational Loss
In this section we present the loss function that we use to fit our model. In this paper we follow a Variational Bayesian approach (see Appendix A for a brief overview). Given a training dataset and a prior distribution , we wish to approximate the posterior distribution . The posterior of is approximated using a variational distribution parameterized by . For the EVGP parameters , we use the following variational posterior distributions:
The priordistributions for are also defined as multivariate normal distributions:
these prior distributions represent our prior knowledge, i.e. our knowledge before looking at the data.
Given the training dataset , the parameters of are learned by minimizing the negative Evidence Lower Bound (ELBO). For the EVGP, the negative ELBO takes the following form:
(3) 
where , and . The term is the KLdivergence between the posterior and prior distributions for the parameters:
A detailed derivation of the variational loss in Eq. (3) is presented in Appendix B. The negative ELBO (Eq. 3) serves as our loss function to learn the parameters given the training dataset . In order to scale to very large datasets, the ELBO is optimized using minibatches (see Appendix C). In our case, the parameters of the variational approximation are: .
2.3 Predictive distribution
After learning the parameters , we would like to provide estimations using the approximated variational distribution . Given a set of test points , the estimated denoised predictive distribution is computed as an expectation of Eq. (2) w.r.t. :
Note that is just a denoised version of . Eq. (4) approximates , using the learned distribution (see Appendix B). The result is equivalent to titsias2009variational with the addition of for the mean and for the covariance. These additional terms include the information provided by the prior function with the parameters and that were learned from data.
The approximated predictive distribution with observation noise is the following:
where . In the next section, we show how Eq. (4) can be used to model system dynamics and predict the next state of a physical system given the control input and current state.
3 Embedding physicsbased knowledge
In this paper, we apply the EVGP model to learn the dynamics of a physical system. The state of the physical system can be modeled as follows:
(5)  
where is the control input and is the measured output of the system. The symbol denotes concatenation and is the input to the EVGP model. For example, in the case of a mechatronic system: are the forces applied by the actuators (e.g. electric motors); is the position and velocity of the joints; is the output from the sensors.
We assume independent EVGP models for each output in the equation (5). The function in Eq. (5) is modeled using the EVGP model from Eq. (1) and Eq. (4). In the following sections we present how we can use simple Newtonian mechanics to define useful priors for the EVGP model.
3.1 Priors from simple Newtonian dynamics
Figure (a)a shows a simple example of a single rigidbody link. The simplest model that we can use for this system comes from Newton’s second law , where is the torque applied to the system,
is the moment of inertia, and
is the angle of the pendulum. Using Euler discretization method, we obtain the following statespace representation that serves as the prior for our EVGP model:(6) 
We refer to this prior as IF (inertia+force). is the discretization time and is the state of the system. The IF prior in Eq. (6) does not include gravitational effects. Gravity pulls the link to the downward position with a force proportional to . Hence, a prior that considers gravitational forces can be constructed by including :
(7) 
we call this prior IFG (inertia+force+gravity). We purposely did not define and . One of the advantages of the presented approach is that the algorithm can learn the parameters from data if they are not available. If the user does not know the value of and
, a prior with large standard deviation can be provided for these parameters (large
). Although parameters like and are easy to compute for a simple pendulum, for more complex systems they may be hard and tedious to obtain. Our objective is to take advantage of domain knowledge and allow the model to fill in the gaps in knowledge.3.2 Simplified priors for Acrobot and Cartpole
In addition to the pendulum, we consider the Acrobot and Cartpole systems in our analysis. For these systems, we consider much simpler priors than the exact dynamic models derived from multibody dynamics. We use the same principles shown in the previous section in order to get simple priors for the Acrobot and Cartpole.
Figure (c)c shows a diagram of the Acrobot system. A simple prior for this system can be constructed using the prior in Eq. (6) for each one of the links of the Acrobot:
(8) 
where , and . In this case, the input only drives the second link. The idea with these priors is to construct an intuitive and simple model from "noisy" Newtonian dynamics. These priors are extremely simple as they do not consider friction or coriolis/centrifugal forces. However, they provide important information about the mechanics of the system. These priors convey the following information:

The position should increase proportional to the velocity by a factor of .

The position should stay the same if the velocity is zero.

The velocity should stay the same if no external forces are applied.

For the IFG prior, gravity pulls the links to the downward position proportional to the sine of the angle w.r.t. the horizontal plane. Gravity has no effect when the links are completely down/up.
The objective with these priors is to demonstrate how extremely simplified priors extracted with simple physics can be used to improve performance of datadriven models. The IF and IFG priors for the Cartpole (Figure (b)b) are constructed using the same principles:
4 Experiments
In order to evaluate the performance of the presented model, we performed experiments on a set of simulated systems: Pendulum, Cartpole and Acrobot. We also performed qualitative tests on a toydataset to visualize the performance of the EVGP model.
We used Drake drake
to simulate the Pendulum, Cartpole and Acrobot systems and obtain the control/sensor data used to train and evaluate the EVGP models. We used the squared exponential function for the covariance kernels. The reason for this choice is that all the experiments involve continuous systems. The EVGP model was implemented using Tensorflow and the minimization of the negative ELBO loss was done using the ADAM optimizer
kingma2014adam . The experiments were run in a computer with a single GPU (Nvidia Quadro P5000) with an Intel(R) Xeon(R) CPU (E31505M at 3.00GHz).4.1 Experiments on Toy Dataset
The toy dataset is intended to serve as an illustration of the behavior of the EVGP model and visualize the qualitative differences between several GP models. We use a modified version of the toy dataset used in snelson2006sparse titsias2009variational . The dataset^{1}^{1}1Obrained from http://www.gatsby.ucl.ac.uk/~snelson/SPGP_dist.tgz is modified as follows:
The modification is intended to provide a global linear trend to the data (See Figure 11). Figure 11 shows the distribution learned using different versions of a Gaussian Process. Figures (a)a and Figure (c)c show the exact posterior distributions for a GP and EGP rasmussen2004gaussian model, respectively. Figures (b)b and (d)d show the variational approximations obtained with a VGP hensman2013gaussian and EVGP model. The standard deviation (black line) is used to visualize the uncertainty. The figures show how the uncertainty grows as we move away from the training dataset.
The original dataset is composed of 200 samples, Figure 11 shows that the variational approximations are able to successfully approximate their exact counterparts with as few inducing points as m=10. The position of the inducing points are shown with green crosses.
In this case, the priorknowledge that we provide to the EVGP is a simple linear function . Figure (d)d shows how we can use the prior in order to control the global shape of the function. The figure shows how the EGP and EVGP models use the prior knowledge to fit the global behavior of the data (linear) while using the kernels to model the local nonlinear behavior.
4.2 Learning system dynamics
We evaluated the performance of the EVGP model in learning system dynamics using data obtained from simulations of the Pendulum, Cartpole and Acrobot systems. Concretely, we evaluated the accuracy of the EVGP model with IF and IFG priors in predicting the next state of the system given the current control inputs and state.
Data: to evaluate the difference in generalization, we sampled two different datasets for each system: one for training and one for testing. The datasets were sampled by simulating the system using random initial states and random control inputs drawn from uniform and normal distributions, respectively. Table 2 shows the values of the scales () that were used to sample the trajectories. These values were chosen in order to cover at least the range () for the angles on the systems. In Table 2, refers to the number of sampled trajectories, refers to the total number of samples. All trajectories were sampled for 100 time steps and .
System  

Pendulum  [, 0.5]  1.0  48  4800 
Cartpole  [1.0, , 0.5, 0.5]  100.0  48  4800 
Acrobot  [, 1.0, 0.5, 0.5]  0.5  93  9300 
Pendulum  Cartpole  Acrobot  
VGP  40  100  250 
RESVGP  40  100  250 
RESDBNN  [15, 15]  [50, 50]  [60, 60] 
EVGPIF  10  60  150 
EVGPIFG  10  60  150 
Baseline:
we compare the EVGP model with a standard VGP model, a residual VGP (RESVGP), and a residual Deep Bayesian Neural Network (RESDBNN). The VGP model is based on
hensman2013gaussian and uses a zero mean prior. The residual VGP and DBNN models assume the system can be approximated as . Approximating residuals is a common approach used to simplify the work for GP and DBNN models deisenroth2011pilco gal2016improving marino2019modeling . The RESDBNN model is trained using the local reparameterization trick presented in kingma2015variational with Normal variational distributions for the weights. For these set of experiments, we did not consider the exact GP and EGP models given the large number of samples in the training datasets.Space  Time  

VGP  
DBNN  
EVGP 
Table 2
shows the number of inducing points (m) used for the VGP and EVGP models. The table also shows the number of hidden units for the DBNN model, where [15, 15] means a twolayer network with 15 units in each layer. We used the LeakyRelu activation function.
Table 3 shows the space and time complexity of the models considered in this paper. In this table, is the number of inducing points, is the number of outputs, is the number of hidden layers, and is the number of hidden units in each layer. To simplify the analysis, we assume all hidden layers of the DBNN have the same number of hidden units. The table shows that the complexity of the VGP and EVGP models are governed by the matrix inversion . Because we assume completely independent VGP and EVGP models for each output of the system, their complexity also depends on the number of outputs
. The complexity of the DBNN model is governed by the matrixvector product between the weight matrices and the hidden activation vectors. All models have constant space complexity w.r.t. the training dataset size
. Furthermore, all models have linear time complexity w.r.t. if we assume that training requires to visit each sample in at least once.Metrics: for comparison, we used two metrics: 1) prediction error, 2) containing ratios (CR). Prediction error is computed as the difference between sampled values () and the expected estimated output ():
(9) 
where is the number of samples in the respective dataset. The expected output () for the EVGP model is equal to in Eq. (4). For the DBNN model, the expectation is estimated using MonteCarlo.
The containing ratios (CR) are the percentage of values covered by the estimated distribution . We consider containing ratios for one, two and three standard deviations (CR1, CR2, and CR3 respectively).
Results: Table 4 shows the prediction error and CR scores obtained in the testing dataset. EVGPIF and EVGPIFG refers to the use of an IF or IFG prior, respectively. We can observe a considerable improvement on the testing error and CR3 scores when using EVGP models. We also see a progressive improvement on the testing error when using more detailed priors. Using IFG prior results in lower prediction errors when compared with the IF prior. Also, the residual models have a lower error than the standard VGP model. The EVGPIFG model provided the estimations with the lowest prediction error and CR3 scores close to 100%.
System  Model  CR1  CR2  CR3  Error  # Param. 

Pendulum  VGP  0.327  0.576  0.716  0.484  407 
RESVGP  0.471  0.762  0.880  0.216  407  
RESDBNN  0.515  0.846  0.955  0.081  634  
EVGPIF  0.745  0.962  0.987  0.075  119  
EVGPIFG  0.704  0.957  0.996  0.064  123  
Cartpole  VGP  0.591  0.856  0.913  0.265  2813 
RESVGP  0.592  0.888  0.946  0.131  2813  
RESDBNN  0.440  0.827  0.921  0.041  6008  
EVGPIF  0.843  0.961  0.983  0.023  1733  
EVGPIFG  0.901  0.979  0.993  0.010  1741  
Acrobot  VGP  0.625  0.818  0.895  0.812  7013 
RESVGP  0.690  0.850  0.912  0.684  7013  
RESDBNN  0.555  0.803  0.897  0.500  8408  
EVGPIF  0.793  0.910  0.956  0.135  4253  
EVGPIFG  0.737  0.899  0.952  0.121  4269 
Table 4 also shows that the EVGP model required the lowest number of parameters. This translates into lower training and inference times with lower memory cost.
Figure 14 shows a comparison of the prediction error on the test dataset as we increase the number of training samples. For this experiment, we kept the testing dataset fixed while samples were progressively aggregated into the training dataset. The figure shows the mean, max and min values obtained for four independent runs. Figure (a)a shows a comparison that includes all models. As expected, the prediction error is reduced as we increase the size of our training dataset. The figure shows that the EVGP provides the most accurate predictions while requiring less number of samples.
Figure (a)a also shows how the performance of VGP and RESVGP plateaus, struggling to take advantage of larger datasets. Although the RESDBNN performs poorly with small training datasets, the high capacity of the RESDBNN model allows it to take advantage or large datasets and improve accuracy, reducing the performance gap w.r.t. the EVGP models as more data is available. Thanks to the lower computational timecost of the RESDBNN (see Table 3), this model can use a larger set of parameters without incurring in excessive training times.
Figure (b)b shows a scaled version that only considers the EVGP model with different priors. This figure shows that the IFG prior provides more accurate predictions when compared to the IF prior. In the case of the pendulum, the IFG prior provides a highly accurate model of the system, requiring only a small number of training samples. Figure (b)b also shows how as training data is aggregated, the accuracy gap between IF and IFG priors is reduced.
The priors that we use are extremely simple, they are ignoring friction and coriolis/centrifugal effects. Nonetheless, we observe a considerable performance improvement after providing our datadriven model basic information with the IF and IFG priors.
Understanding the learned model: one of the advantages of incorporating domain knowledge is that the learned model is easy to interpret by the domain expert. For example, in the case of the Acrobot, the value of the parameter can be visualized to understand and debug what the model has learned. Figure 17 shows the value of learned with the IF (Fig. (a)a) and IFG priors (Fig. (b)b).
We observe in Figure 17 that the learned parameters follow a similar structure given by the prior (see Eq. 8). In our experiments, we did not enforce the sparse structure from the priors, i.e. zero parameters in the prior are allowed to have nonzero values in the posterior.
Figure (a)a shows that when using the IF prior, the EVGP model compensates for the missing gravity information by assigning negative values to and . The reason for this behavior is that for small , however this approximation is no longer valid for large . When using IFG priors (Fig. (b)b), we observe that the model no longer assigns negative values to and . The reason is that IFG provides the values of and which help to model the effect of gravity more accurately.
5 Related work
Incorporating prior scientific knowledge in machinelearning algorithms is an ongoing effort that has recently gained increased attention. Convolutional neural networks (CNNs) have been used in modeling and simulation applications such as forecasting of sea surface temperature
de2017deep and efficient simulation of the NavierStokes equations tompson2017accelerating. Gaussian Processes (GP) provide a general purpose nonparametric model that has been used for system identification and control under uncertainty
deisenroth2011pilco bijl2017system . Previous work has explored using GPs to include partial model information hall2012modelling . In gray2018hybrid a GP model is used to correct a highly simplified physics model and improve accuracy. Our work is based on the GP model with explicit features presented in rasmussen2004gaussian . Variations of this model are commonly used in calibration of computer models kennedy2001bayesian li2016integrating .Despite the advantages of GP models for modeling complex nonlinear relations with uncertainty, GPs are computationally expensive. A large bulk of research has focused on improving the computational requirements of GPs. Sparse GP approximation methods are some of the most common approaches for reducing GP computation cost quinonero2005unifying . Bayesian approximation techniques such as Variational Inference provide a rich toolset for dealing with large quantities of data and highly complex models kingma2013auto titsias2009variational . Variational approximations of a sparse GP have been explored in titsias2009variational hensman2013gaussian . In frigola2014variational a variational GP model is presented for nonlinear statespace models. In gal2016improving marino2019modeling
, Deep Bayesian Neural Networks (DBNNs) are proposed as an alternative to GPs in order to improve scalability in reinforcement learning problems. Given the popularity of GP models and Variational Inference, there is an increased interest on developing automated variational techniques for these type of models
nguyen2014automated kucukelbir2017automatic .6 Conclusion
In this paper, we presented the EVGP model, a variational approximation of a Gaussian Process which uses domain knowledge to define the mean function of the prior. We compared the performance of the EVGP model against purely datadriven approaches and demonstrated improved accuracy and interpretability after incorporating simple priors derived from Neutonian mechanics. The EVGP also provided higher accuracy with smaller training datasets. The priors provided a rough but simple approximation of the mechanics, informing the EVGP of important structure of the real system.
References

(1)
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2015, pp. 1–9. 
(2)
A. Adadi and M. Berrada, “Peeking inside the blackbox: A survey on explainable artificial intelligence (xai),”
IEEE Access, vol. 6, pp. 52 138–52 160, 2018.  (3) R. Frigola, Y. Chen, and C. E. Rasmussen, “Variational gaussian process statespace models,” in Advances in Neural Information Processing Systems, 2014, pp. 3680–3688.
 (4) D. Marino, K. Amarasinghe, M. Anderson, N. Yancey, Q. Nguyen, K. Kenney, and M. Manic, “Data driven decision support for reliable biomass feedstock preprocessing,” in Resilience Week (RWS), 2017. IEEE, 2017, pp. 97–102.
 (5) C. E. Rasmussen, “Gaussian processes in machine learning,” in Advanced lectures on machine learning. Springer, 2004, pp. 27–29.
 (6) J. QuiñoneroCandela and C. E. Rasmussen, “A unifying view of sparse approximate gaussian process regression,” Journal of Machine Learning Research, vol. 6, no. Dec, pp. 1939–1959, 2005.
 (7) M. Titsias, “Variational learning of inducing variables in sparse gaussian processes,” in Artificial Intelligence and Statistics, 2009, pp. 567–574.
 (8) R. Tedrake and the Drake Development Team, “Drake: A planning, control, and analysis toolbox for nonlinear dynamical systems,” 2016. [Online]. Available: https://drake.mit.edu
 (9) D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 (10) E. Snelson and Z. Ghahramani, “Sparse gaussian processes using pseudoinputs,” in Advances in neural information processing systems, 2006, pp. 1257–1264.
 (11) J. Hensman, N. Fusi, and N. D. Lawrence, “Gaussian processes for big data,” arXiv preprint arXiv:1309.6835, 2013.
 (12) M. Deisenroth and C. E. Rasmussen, “Pilco: A modelbased and dataefficient approach to policy search,” in Proceedings of the 28th International Conference on machine learning (ICML11), 2011, pp. 465–472.
 (13) Y. Gal, R. T. McAllister, and C. E. Rasmussen, “Improving pilco with bayesian neural network dynamics models,” in DataEfficient Machine Learning workshop, 2016.
 (14) D. Marino and M. Manic, “Modeling and planning under uncertainty using deep neural networks,” IEEE Transactions on Industrial Informatics, pp. 1–1, 2019.
 (15) D. P. Kingma, T. Salimans, and M. Welling, “Variational dropout and the local reparameterization trick,” in Advances in Neural Information Processing Systems, 2015, pp. 2575–2583.
 (16) E. de Bezenac, A. Pajot, and P. Gallinari, “Deep learning for physical processes: Incorporating prior scientific knowledge,” arXiv preprint arXiv:1711.07970, 2017.
 (17) J. Tompson, K. Schlachter, P. Sprechmann, and K. Perlin, “Accelerating eulerian fluid simulation with convolutional networks,” in Proceedings of the 34th International Conference on Machine LearningVolume 70. JMLR. org, 2017, pp. 3424–3433.
 (18) H. Bijl, T. B. Schön, J.W. van Wingerden, and M. Verhaegen, “System identification through online sparse gaussian process regression with input noise,” IFAC Journal of Systems and Control, vol. 2, pp. 1–11, 2017.
 (19) J. Hall, C. Rasmussen, and J. Maciejowski, “Modelling and control of nonlinear systems using gaussian processes with partial model information,” in 2012 IEEE 51st IEEE Conference on Decision and Control (CDC). IEEE, 2012, pp. 5266–5271.
 (20) F. M. Gray and M. Schmidt, “A hybrid approach to thermal building modelling using a combination of gaussian processes and greybox models,” Energy and Buildings, vol. 165, pp. 56–63, 2018.
 (21) M. C. Kennedy and A. O’Hagan, “Bayesian calibration of computer models,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 63, no. 3, pp. 425–464, 2001.
 (22) W. Li, S. Chen, Z. Jiang, D. W. Apley, Z. Lu, and W. Chen, “Integrating bayesian calibration, bias correction, and machine learning for the 2014 sandia verification and validation challenge problem,” Journal of Verification, Validation and Uncertainty Quantification, vol. 1, no. 1, p. 011004, 2016.
 (23) D. P. Kingma and M. Welling, “Autoencoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
 (24) T. V. Nguyen and E. V. Bonilla, “Automated variational inference for gaussian process models,” in Advances in Neural Information Processing Systems, 2014, pp. 1404–1412.
 (25) A. Kucukelbir, D. Tran, R. Ranganath, A. Gelman, and D. M. Blei, “Automatic differentiation variational inference,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 430–474, 2017.

(26)
Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in
international conference on machine learning, 2016, pp. 1050–1059.
Appendix A Variational Inference
In a Bayesian learning approach, we model the parameters of our model
using probability distributions. The parameters are given a prior distribution
that represents our prior knowledge about the model before looking at the data. Given a dataset , we would like to obtain the posterior distribution of our parameters following the Bayes rule:Variational Inference (VI) provides a tool for approximating the posterior using a variational distribution parameterized by . In other words, with VI we find the value of such that . The parameters of the distribution are found by maximizing the Evidence Lower Bound (ELBO) between the approximate and real distributions:
Maximizing the ELBO is equivalent to minimizing the KL divergence between the variational distribution and the real distribution.
Having obtained the variational approximation , we can approximate the predictive distribution as follows:
which is the approximated variational predictive distribution (see section 2.3):
Appendix B Elbo
Given the training dataset , the parameters of are learned by minimizing the negative Evidence Lower Bound (ELBO). For the EVGP, the negative ELBO takes the following form:
(10) 
where denotes the KL divergence between the variational posterior and the prior Note that the inner expectation in Eq. (10) is taken w.r.t. , presented in Eq. (2). Following a similar approach than hensman2013gaussian , we apply Jensen’s inequality in the inner expectation of Eq. (10):
where , . This allows us to express the ELBO in a way that simplifies the computation of the expectations w.r.t. the parameters . Now, the variational loss in Eq. (3) can be obtained by simply computing the expectation w.r.t. the model parameters:
where , and . The value of the KLdivergence is simply the sum of the divergence for both parameters:
Appendix C Minibatch optimization
In order to make the model scalable to very large datasets, the ELBO can be optimized using minibatches. Following gal2016dropout , assuming the samples are i.i.d., the loss for a minibatch composed of number of samples can be expressed as follows:
where is the total number of samples in the training dataset.
Comments
There are no comments yet.