For centuries, scientists and engineers have worked on creating mathematical abstractions of real world systems. This principled modeling approach provides a powerful toolbox to derive white-box models that we can use to understand and analyze physical systems. However, as the complexity of physical systems grow, deriving detailed principled models becomes an expensive and tedious task that requires highly experienced scientists and engineers. Moreover, incorrect assumptions leads to inaccurate models that are unable to represent the real system.
Data-driven black-box models provide an appealing alternative modeling approach that requires little to none domain knowledge. These models are fit to data extracted from the real system, minimizing the problems derived from incorrect assumptions. However, using data-driven models while completely ignoring domain knowledge may lead to models that do not generalize well and are hard to understand. Completely black-box approaches ignore the structure of the problem, wasting resources szegedy2015going and making the model less explainable adadi2018peeking .
Gray-box models combine domain knowledge and data as both provide important and complementary information about the system. Domain knowledge can be used to construct a set of basic assumptions about the system, giving the data-driven model a baseline to build upon. Data can be used to fill the gaps in knowledge and model complex relations that were not considered by the domain expert.
In this paper, we explore an approach for embedding domain knowledge into a data-driven model in order to improve generalization and interpretability. The presented gray-box model, which we called EVGP (Explicit Variational Gaussian Process), is a scalable approximation of a sparse Gaussian Process (GP) that uses domain knowledge to define the prior distribution of the GP. In this paper domain knowledge is extracted from physics-based knowledge, however the EVGP can be applied to any domain.
The work on this paper has three corner stones (Fig. 2
): 1) Gaussian processes are used for learning complex non-linear behavior from data and model uncertainty, 2) Partial domain knowledge is used as prior in order to improve inductive bias, 3) Variational Inference is used to find an approximation that scales well to large datasets. Inductive bias refers to the assumptions made by the model when doing predictions over inputs that have not been observed. The presented approach provides uncertainty estimations which are fundamental in order to avoid the risk associated with overconfidence in unexplored areasfrigola2014variational and warns the user of possible incorrect estimations marino2017data .
The work in this paper is highly applicable when: 1) modeling physical systems with uncertainty estimations, 2) partial domain knowledge of the system is available, 3) large quantities of data are available. The aim is to help the engineer and take advantage of available knowledge without requiring the derivation of complex and detailed models. Instead, an engineer only has to provide simple, partially formed, models and the EVGP takes care of filling the gaps in knowledge. We show how the EVGP model can be used to learn system dynamics using basic physics laws as prior knowledge.
2 Variational GP using explicit features
The EVGP model is designed to solve regression problems under uncertainty. Given a dataset composed of input/output pairs of samples , we would like to obtain a predictive distribution that estimates the value of the output for a given input . The EVGP model approximates using Variational Inference. The EVGP is defined as a distribution where are a set of parameters with prior distribution .
In the following sections we describe in detail: A) the EVGP model, B)
the variational loss function used to train the model,C) the predictive distribution that approximates .
2.1 Model Definition
The EVGP model takes the following form:
where is a Gaussian process with kernel , is the observation noise and is the denoised prediction for the input . Figure 2 offers a visual representation of the model. The following is the description of the main components of the EVGP model:
Domain knowledge is embedded in the explicit function , parameterized by . The function describes a set of explicit features (hence the name of our method) provided by the domain expert.
is modeled using a normal distribution with a prior that is also extracted from domain knowledge. In this paper,is derived from partially defined Neutonian mechanics.
The Gaussian Process adds the ability to learn complex non-linear relations that is unable to capture.
Given a dataset , the exact predictive distribution for the model in Eq. (1) is described in rasmussen2004gaussian . For the rest of the paper, we refer to the exact distribution as EGP. Computing the EGP predictive distribution has a large computational cost and does not scale well for large datasets. To alleviate this problem, sparse approximation methods quinonero2005unifying use a small set of inducing points instead of the entire dataset to approximate the predictive distribution.
In order to construct a sparse approximation for the model in Eq. (1),
we use a set of inducing points as parameters that
will be learned from data.
Given and a set of
test points , the prior distribution of the model can be expressed as follows:
where denotes the data matrix, where each row represents an individual sample. The rows of represent the value of the function applied to the real samples . The rows of represent the value of the function applied to the inducing (learned) points
. Using the conditional rule for multivariate Gaussian distributions, we obtain the definition of the denoised sparse EVGP model:
where are the parameters of our model, , and . Equation (2) defines our scalable EVGP model. In following sections we give prior distributions to the parameters
and perform approximate Bayesian inference. Note that Eq. (2) is also conditioned on , however we do not indicate this explicitly in order to improve readability.
2.2 Variational Loss
In this section we present the loss function that we use to fit our model. In this paper we follow a Variational Bayesian approach (see Appendix A for a brief overview). Given a training dataset and a prior distribution , we wish to approximate the posterior distribution . The posterior of is approximated using a variational distribution parameterized by . For the EVGP parameters , we use the following variational posterior distributions:
The prior-distributions for are also defined as multivariate normal distributions:
these prior distributions represent our prior knowledge, i.e. our knowledge before looking at the data.
Given the training dataset , the parameters of are learned by minimizing the negative Evidence Lower Bound (ELBO). For the EVGP, the negative ELBO takes the following form:
where , and . The term is the KL-divergence between the posterior and prior distributions for the parameters:
A detailed derivation of the variational loss in Eq. (3) is presented in Appendix B. The negative ELBO (Eq. 3) serves as our loss function to learn the parameters given the training dataset . In order to scale to very large datasets, the ELBO is optimized using mini-batches (see Appendix C). In our case, the parameters of the variational approximation are: .
2.3 Predictive distribution
After learning the parameters , we would like to provide
estimations using the approximated variational distribution .
Given a set of test points , the estimated denoised predictive distribution
is computed as an expectation of Eq. (2) w.r.t. :
Note that is just a denoised version of . Eq. (4) approximates , using the learned distribution (see Appendix B). The result is equivalent to titsias2009variational with the addition of for the mean and for the covariance. These additional terms include the information provided by the prior function with the parameters and that were learned from data.
The approximated predictive distribution with observation noise is the following:
where . In the next section, we show how Eq. (4) can be used to model system dynamics and predict the next state of a physical system given the control input and current state.
3 Embedding physics-based knowledge
In this paper, we apply the EVGP model to learn the dynamics of a physical system. The state of the physical system can be modeled as follows:
where is the control input and is the measured output of the system. The symbol denotes concatenation and is the input to the EVGP model. For example, in the case of a mechatronic system: are the forces applied by the actuators (e.g. electric motors); is the position and velocity of the joints; is the output from the sensors.
We assume independent EVGP models for each output in the equation (5). The function in Eq. (5) is modeled using the EVGP model from Eq. (1) and Eq. (4). In the following sections we present how we can use simple Newtonian mechanics to define useful priors for the EVGP model.
3.1 Priors from simple Newtonian dynamics
Figure (a)a shows a simple example of a single rigid-body link. The simplest model that we can use for this system comes from Newton’s second law , where is the torque applied to the system,
is the moment of inertia, andis the angle of the pendulum. Using Euler discretization method, we obtain the following state-space representation that serves as the prior for our EVGP model:
We refer to this prior as IF (inertia+force). is the discretization time and is the state of the system. The IF prior in Eq. (6) does not include gravitational effects. Gravity pulls the link to the downward position with a force proportional to . Hence, a prior that considers gravitational forces can be constructed by including :
we call this prior IFG (inertia+force+gravity). We purposely did not define and . One of the advantages of the presented approach is that the algorithm can learn the parameters from data if they are not available. If the user does not know the value of and
, a prior with large standard deviation can be provided for these parameters (large). Although parameters like and are easy to compute for a simple pendulum, for more complex systems they may be hard and tedious to obtain. Our objective is to take advantage of domain knowledge and allow the model to fill in the gaps in knowledge.
3.2 Simplified priors for Acrobot and Cartpole
In addition to the pendulum, we consider the Acrobot and Cartpole systems in our analysis. For these systems, we consider much simpler priors than the exact dynamic models derived from multi-body dynamics. We use the same principles shown in the previous section in order to get simple priors for the Acrobot and Cartpole.
where , and . In this case, the input only drives the second link. The idea with these priors is to construct an intuitive and simple model from "noisy" Newtonian dynamics. These priors are extremely simple as they do not consider friction or coriolis/centrifugal forces. However, they provide important information about the mechanics of the system. These priors convey the following information:
The position should increase proportional to the velocity by a factor of .
The position should stay the same if the velocity is zero.
The velocity should stay the same if no external forces are applied.
For the IFG prior, gravity pulls the links to the downward position proportional to the sine of the angle w.r.t. the horizontal plane. Gravity has no effect when the links are completely down/up.
The objective with these priors is to demonstrate how extremely simplified priors extracted with simple physics can be used to improve performance of data-driven models. The IF and IFG priors for the Cartpole (Figure (b)b) are constructed using the same principles:
In order to evaluate the performance of the presented model, we performed experiments on a set of simulated systems: Pendulum, Cartpole and Acrobot. We also performed qualitative tests on a toy-dataset to visualize the performance of the EVGP model.
We used Drake drake
to simulate the Pendulum, Cartpole and Acrobot systems and obtain the control/sensor data used to train and evaluate the EVGP models. We used the squared exponential function for the covariance kernels. The reason for this choice is that all the experiments involve continuous systems. The EVGP model was implemented using Tensorflow and the minimization of the negative ELBO loss was done using the ADAM optimizerkingma2014adam . The experiments were run in a computer with a single GPU (Nvidia Quadro P5000) with an Intel(R) Xeon(R) CPU (E3-1505M at 3.00GHz).
4.1 Experiments on Toy Dataset
The toy dataset is intended to serve as an illustration of the behavior of the EVGP model and visualize the qualitative differences between several GP models. We use a modified version of the toy dataset used in snelson2006sparse titsias2009variational . The dataset111Obrained from http://www.gatsby.ucl.ac.uk/~snelson/SPGP_dist.tgz is modified as follows:
The modification is intended to provide a global linear trend to the data (See Figure 11). Figure 11 shows the distribution learned using different versions of a Gaussian Process. Figures (a)a and Figure (c)c show the exact posterior distributions for a GP and EGP rasmussen2004gaussian model, respectively. Figures (b)b and (d)d show the variational approximations obtained with a VGP hensman2013gaussian and EVGP model. The standard deviation (black line) is used to visualize the uncertainty. The figures show how the uncertainty grows as we move away from the training dataset.
The original dataset is composed of 200 samples, Figure 11 shows that the variational approximations are able to successfully approximate their exact counterparts with as few inducing points as m=10. The position of the inducing points are shown with green crosses.
In this case, the prior-knowledge that we provide to the EVGP is a simple linear function . Figure (d)d shows how we can use the prior in order to control the global shape of the function. The figure shows how the EGP and EVGP models use the prior knowledge to fit the global behavior of the data (linear) while using the kernels to model the local non-linear behavior.
4.2 Learning system dynamics
We evaluated the performance of the EVGP model in learning system dynamics using data obtained from simulations of the Pendulum, Cartpole and Acrobot systems. Concretely, we evaluated the accuracy of the EVGP model with IF and IFG priors in predicting the next state of the system given the current control inputs and state.
Data: to evaluate the difference in generalization, we sampled two different datasets for each system: one for training and one for testing. The datasets were sampled by simulating the system using random initial states and random control inputs drawn from uniform and normal distributions, respectively. Table 2 shows the values of the scales () that were used to sample the trajectories. These values were chosen in order to cover at least the range () for the angles on the systems. In Table 2, refers to the number of sampled trajectories, refers to the total number of samples. All trajectories were sampled for 100 time steps and .
|Cartpole||[1.0, , 0.5, 0.5]||100.0||48||4800|
|Acrobot||[, 1.0, 0.5, 0.5]||0.5||93||9300|
|RES-DBNN||[15, 15]||[50, 50]||[60, 60]|
we compare the EVGP model with a standard VGP model, a residual VGP (RES-VGP), and a residual Deep Bayesian Neural Network (RES-DBNN). The VGP model is based onhensman2013gaussian and uses a zero mean prior. The residual VGP and DBNN models assume the system can be approximated as . Approximating residuals is a common approach used to simplify the work for GP and DBNN models deisenroth2011pilco gal2016improving marino2019modeling . The RES-DBNN model is trained using the local reparameterization trick presented in kingma2015variational with Normal variational distributions for the weights. For these set of experiments, we did not consider the exact GP and EGP models given the large number of samples in the training datasets.
shows the number of inducing points (m) used for the VGP and EVGP models. The table also shows the number of hidden units for the DBNN model, where [15, 15] means a two-layer network with 15 units in each layer. We used the LeakyRelu activation function.
Table 3 shows the space and time complexity of the models considered in this paper. In this table, is the number of inducing points, is the number of outputs, is the number of hidden layers, and is the number of hidden units in each layer. To simplify the analysis, we assume all hidden layers of the DBNN have the same number of hidden units. The table shows that the complexity of the VGP and EVGP models are governed by the matrix inversion . Because we assume completely independent VGP and EVGP models for each output of the system, their complexity also depends on the number of outputs
. The complexity of the DBNN model is governed by the matrix-vector product between the weight matrices and the hidden activation vectors. All models have constant space complexity w.r.t. the training dataset size. Furthermore, all models have linear time complexity w.r.t. if we assume that training requires to visit each sample in at least once.
Metrics: for comparison, we used two metrics: 1) prediction error, 2) containing ratios (CR). Prediction error is computed as the difference between sampled values () and the expected estimated output ():
where is the number of samples in the respective dataset. The expected output () for the EVGP model is equal to in Eq. (4). For the DBNN model, the expectation is estimated using Monte-Carlo.
The containing ratios (CR) are the percentage of values covered by the estimated distribution . We consider containing ratios for one, two and three standard deviations (CR-1, CR-2, and CR-3 respectively).
Results: Table 4 shows the prediction error and CR scores obtained in the testing dataset. EVGP-IF and EVGP-IFG refers to the use of an IF or IFG prior, respectively. We can observe a considerable improvement on the testing error and CR-3 scores when using EVGP models. We also see a progressive improvement on the testing error when using more detailed priors. Using IFG prior results in lower prediction errors when compared with the IF prior. Also, the residual models have a lower error than the standard VGP model. The EVGP-IFG model provided the estimations with the lowest prediction error and CR-3 scores close to 100%.
Table 4 also shows that the EVGP model required the lowest number of parameters. This translates into lower training and inference times with lower memory cost.
Figure 14 shows a comparison of the prediction error on the test dataset as we increase the number of training samples. For this experiment, we kept the testing dataset fixed while samples were progressively aggregated into the training dataset. The figure shows the mean, max and min values obtained for four independent runs. Figure (a)a shows a comparison that includes all models. As expected, the prediction error is reduced as we increase the size of our training dataset. The figure shows that the EVGP provides the most accurate predictions while requiring less number of samples.
Figure (a)a also shows how the performance of VGP and RES-VGP plateaus, struggling to take advantage of larger datasets. Although the RES-DBNN performs poorly with small training datasets, the high capacity of the RES-DBNN model allows it to take advantage or large datasets and improve accuracy, reducing the performance gap w.r.t. the EVGP models as more data is available. Thanks to the lower computational time-cost of the RES-DBNN (see Table 3), this model can use a larger set of parameters without incurring in excessive training times.
Figure (b)b shows a scaled version that only considers the EVGP model with different priors. This figure shows that the IFG prior provides more accurate predictions when compared to the IF prior. In the case of the pendulum, the IFG prior provides a highly accurate model of the system, requiring only a small number of training samples. Figure (b)b also shows how as training data is aggregated, the accuracy gap between IF and IFG priors is reduced.
The priors that we use are extremely simple, they are ignoring friction and coriolis/centrifugal effects. Nonetheless, we observe a considerable performance improvement after providing our data-driven model basic information with the IF and IFG priors.
Understanding the learned model: one of the advantages of incorporating domain knowledge is that the learned model is easy to interpret by the domain expert. For example, in the case of the Acrobot, the value of the parameter can be visualized to understand and debug what the model has learned. Figure 17 shows the value of learned with the IF (Fig. (a)a) and IFG priors (Fig. (b)b).
We observe in Figure 17 that the learned parameters follow a similar structure given by the prior (see Eq. 8). In our experiments, we did not enforce the sparse structure from the priors, i.e. zero parameters in the prior are allowed to have non-zero values in the posterior.
Figure (a)a shows that when using the IF prior, the EVGP model compensates for the missing gravity information by assigning negative values to and . The reason for this behavior is that for small , however this approximation is no longer valid for large . When using IFG priors (Fig. (b)b), we observe that the model no longer assigns negative values to and . The reason is that IFG provides the values of and which help to model the effect of gravity more accurately.
5 Related work
Incorporating prior scientific knowledge in machine-learning algorithms is an ongoing effort that has recently gained increased attention. Convolutional neural networks (CNNs) have been used in modeling and simulation applications such as forecasting of sea surface temperaturede2017deep and efficient simulation of the Navier-Stokes equations tompson2017accelerating
. Gaussian Processes (GP) provide a general purpose non-parametric model that has been used for system identification and control under uncertaintydeisenroth2011pilco bijl2017system . Previous work has explored using GPs to include partial model information hall2012modelling . In gray2018hybrid a GP model is used to correct a highly simplified physics model and improve accuracy. Our work is based on the GP model with explicit features presented in rasmussen2004gaussian . Variations of this model are commonly used in calibration of computer models kennedy2001bayesian li2016integrating .
Despite the advantages of GP models for modeling complex non-linear relations with uncertainty, GPs are computationally expensive. A large bulk of research has focused on improving the computational requirements of GPs. Sparse GP approximation methods are some of the most common approaches for reducing GP computation cost quinonero2005unifying . Bayesian approximation techniques such as Variational Inference provide a rich toolset for dealing with large quantities of data and highly complex models kingma2013auto titsias2009variational . Variational approximations of a sparse GP have been explored in titsias2009variational hensman2013gaussian . In frigola2014variational a variational GP model is presented for nonlinear state-space models. In gal2016improving marino2019modeling
, Deep Bayesian Neural Networks (DBNNs) are proposed as an alternative to GPs in order to improve scalability in reinforcement learning problems. Given the popularity of GP models and Variational Inference, there is an increased interest on developing automated variational techniques for these type of modelsnguyen2014automated kucukelbir2017automatic .
In this paper, we presented the EVGP model, a variational approximation of a Gaussian Process which uses domain knowledge to define the mean function of the prior. We compared the performance of the EVGP model against purely data-driven approaches and demonstrated improved accuracy and interpretability after incorporating simple priors derived from Neutonian mechanics. The EVGP also provided higher accuracy with smaller training datasets. The priors provided a rough but simple approximation of the mechanics, informing the EVGP of important structure of the real system.
- (1) C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in
A. Adadi and M. Berrada, “Peeking inside the black-box: A survey on explainable artificial intelligence (xai),”IEEE Access, vol. 6, pp. 52 138–52 160, 2018.
- (3) R. Frigola, Y. Chen, and C. E. Rasmussen, “Variational gaussian process state-space models,” in Advances in Neural Information Processing Systems, 2014, pp. 3680–3688.
- (4) D. Marino, K. Amarasinghe, M. Anderson, N. Yancey, Q. Nguyen, K. Kenney, and M. Manic, “Data driven decision support for reliable biomass feedstock preprocessing,” in Resilience Week (RWS), 2017. IEEE, 2017, pp. 97–102.
- (5) C. E. Rasmussen, “Gaussian processes in machine learning,” in Advanced lectures on machine learning. Springer, 2004, pp. 27–29.
- (6) J. Quiñonero-Candela and C. E. Rasmussen, “A unifying view of sparse approximate gaussian process regression,” Journal of Machine Learning Research, vol. 6, no. Dec, pp. 1939–1959, 2005.
- (7) M. Titsias, “Variational learning of inducing variables in sparse gaussian processes,” in Artificial Intelligence and Statistics, 2009, pp. 567–574.
- (8) R. Tedrake and the Drake Development Team, “Drake: A planning, control, and analysis toolbox for nonlinear dynamical systems,” 2016. [Online]. Available: https://drake.mit.edu
- (9) D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- (10) E. Snelson and Z. Ghahramani, “Sparse gaussian processes using pseudo-inputs,” in Advances in neural information processing systems, 2006, pp. 1257–1264.
- (11) J. Hensman, N. Fusi, and N. D. Lawrence, “Gaussian processes for big data,” arXiv preprint arXiv:1309.6835, 2013.
- (12) M. Deisenroth and C. E. Rasmussen, “Pilco: A model-based and data-efficient approach to policy search,” in Proceedings of the 28th International Conference on machine learning (ICML-11), 2011, pp. 465–472.
- (13) Y. Gal, R. T. McAllister, and C. E. Rasmussen, “Improving pilco with bayesian neural network dynamics models,” in Data-Efficient Machine Learning workshop, 2016.
- (14) D. Marino and M. Manic, “Modeling and planning under uncertainty using deep neural networks,” IEEE Transactions on Industrial Informatics, pp. 1–1, 2019.
- (15) D. P. Kingma, T. Salimans, and M. Welling, “Variational dropout and the local reparameterization trick,” in Advances in Neural Information Processing Systems, 2015, pp. 2575–2583.
- (16) E. de Bezenac, A. Pajot, and P. Gallinari, “Deep learning for physical processes: Incorporating prior scientific knowledge,” arXiv preprint arXiv:1711.07970, 2017.
- (17) J. Tompson, K. Schlachter, P. Sprechmann, and K. Perlin, “Accelerating eulerian fluid simulation with convolutional networks,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 3424–3433.
- (18) H. Bijl, T. B. Schön, J.-W. van Wingerden, and M. Verhaegen, “System identification through online sparse gaussian process regression with input noise,” IFAC Journal of Systems and Control, vol. 2, pp. 1–11, 2017.
- (19) J. Hall, C. Rasmussen, and J. Maciejowski, “Modelling and control of nonlinear systems using gaussian processes with partial model information,” in 2012 IEEE 51st IEEE Conference on Decision and Control (CDC). IEEE, 2012, pp. 5266–5271.
- (20) F. M. Gray and M. Schmidt, “A hybrid approach to thermal building modelling using a combination of gaussian processes and grey-box models,” Energy and Buildings, vol. 165, pp. 56–63, 2018.
- (21) M. C. Kennedy and A. O’Hagan, “Bayesian calibration of computer models,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 63, no. 3, pp. 425–464, 2001.
- (22) W. Li, S. Chen, Z. Jiang, D. W. Apley, Z. Lu, and W. Chen, “Integrating bayesian calibration, bias correction, and machine learning for the 2014 sandia verification and validation challenge problem,” Journal of Verification, Validation and Uncertainty Quantification, vol. 1, no. 1, p. 011004, 2016.
- (23) D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
- (24) T. V. Nguyen and E. V. Bonilla, “Automated variational inference for gaussian process models,” in Advances in Neural Information Processing Systems, 2014, pp. 1404–1412.
- (25) A. Kucukelbir, D. Tran, R. Ranganath, A. Gelman, and D. M. Blei, “Automatic differentiation variational inference,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 430–474, 2017.
Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” ininternational conference on machine learning, 2016, pp. 1050–1059.
Appendix A Variational Inference
In a Bayesian learning approach, we model the parameters of our model
using probability distributions. The parameters are given a prior distributionthat represents our prior knowledge about the model before looking at the data. Given a dataset , we would like to obtain the posterior distribution of our parameters following the Bayes rule:
Variational Inference (VI) provides a tool for approximating the posterior using a variational distribution parameterized by . In other words, with VI we find the value of such that . The parameters of the distribution are found by maximizing the Evidence Lower Bound (ELBO) between the approximate and real distributions:
Maximizing the ELBO is equivalent to minimizing the KL divergence between the variational distribution and the real distribution.
Having obtained the variational approximation , we can approximate the predictive distribution as follows:
which is the approximated variational predictive distribution (see section 2.3):
Appendix B Elbo
Given the training dataset , the parameters of are learned by minimizing the negative Evidence Lower Bound (ELBO). For the EVGP, the negative ELBO takes the following form:
where denotes the KL divergence between the variational posterior and the prior Note that the inner expectation in Eq. (10) is taken w.r.t. , presented in Eq. (2). Following a similar approach than hensman2013gaussian , we apply Jensen’s inequality in the inner expectation of Eq. (10):
where , . This allows us to express the ELBO in a way that simplifies the computation of the expectations w.r.t. the parameters . Now, the variational loss in Eq. (3) can be obtained by simply computing the expectation w.r.t. the model parameters:
where , and . The value of the KL-divergence is simply the sum of the divergence for both parameters:
Appendix C Mini-batch optimization
In order to make the model scalable to very large datasets, the ELBO can be optimized using mini-batches. Following gal2016dropout , assuming the samples are i.i.d., the loss for a mini-batch composed of number of samples can be expressed as follows:
where is the total number of samples in the training dataset.