1 Introduction
Due to their nonparametric nature, Gaussian Processes (GPs) (Rasmussen & Williams, 2006)
are a powerful and flexible principled way for nonlinear probabilistic regression. In the past years, GPs have had a substantial impact in various research areas, including reinforcement learning
(Rasmussen et al., 2003)(Park et al., 2011), preference learning (Chu & Ghahramani, 2005) and parameter optimization (Rana et al., 2017). However, a limitation of the GP model is that learning scales with in the number of observations and has a memory consumption of – where is the number of dimensions – making it unpractical for large datasets.Several approaches to overcome the computational complexity in GPs have been proposed, which can be categorized into two main strategies. The first strategy employs sparse approximations of GPs (Williams & Seeger, 2000; Candela & Rasmussen, 2005; Hensman et al., 2013; Gal & Turner, 2015; Bauer et al., 2016), reducing the inference cost to cubic (time) and quadratic (memory) in the number of socalled inducing points, rather than the number of samples. By doing so, sparse approximations allow to scale GPs up to reasonably large datasets, comprising millions of samples, while typically using only a few hundred inducing points.
Alternative to sparse approximations, the second strategy aims to distribute the computations by using local models or hierarchies of thereof (Shen et al., 2005; Ng & Deisenroth, 2014; Cao & Fleet, 2014; Deisenroth & Ng, 2015). Those approaches are usually related to the Bayesian Committee Machine (BCM) (Tresp, 2000) or the Product of Experts (PoE) (Ng & Deisenroth, 2014) approach. However, as discussed by Deisenroth and Ng (Deisenroth & Ng, 2015)
these approaches have either the shortcoming that with an increasing number of local models the predictive variance vanishes or that the posterior mean suffers from weak experts.
In this paper, we propose a natural and expressive framework which falls in the latter strategy of hierarchically composing local GP models. In particular, we combine GPs with sumproduct networks (SPNs), an expressive class of probabilistic models allowing exact and efficient inference. SPNs represent probability distributions by recursively utilizing factorization (product nodes) and mixtures (sum node) according to an acyclic directed graph (DAG). The base of this recursion is reached at the leaves of this DAG, which represent userspecified input distributions, each defined only over a subscope of the involved random variables. SPNs typically allow reducing the mechanism of inference to the corresponding mechanisms at the leaves. For example, marginalization in SPNs is performed by computing the corresponding marginalization tasks at the leaves and evaluating the inner nodes (sums and products) as usual. Thus, marginalization in SPNs is performed in time linear of the network size, plus the inference costs at the leaves. So far, SPNs have mainly been applied to density estimation tasks akin to graphical models, typically using Gaussian or categorical distributions as leaves.
However, the crucial insight exploited in this paper is, that SPNs are a sound language for any probabilistic model used as leaves. In particular, it is no problem to use GPs as the leaves of the SPN, immediately yielding an “SPN over GPs”, or in other words a deep hierarchically structured mixture of local GP experts. It is easy to see that this model represent a prior over functions, which has, to the best of our knowledge, not been considered before.
We show that in contrast to prior work our approach does not suffer from vanishing predictive variances, allows for exact posterior inference and is flexible enough to be used for nonstationary data. Further, our model is able to mix over different kernel functions weighted by their plausibility, reducing the effect of weak experts.
The rest of the paper is structured as follows. After reviewing related work and background information on SPNs and GPs, Section 3 introduces our SPNGP model. In Section 3, we further introduce a generic structure learning approach for SPNGPs, discuss hyperparameter learning and show to perform exact posterior inference. We assess the performance of our approach qualitatively and quantitatively in Section 4. And finally, discuss and conclude our work in Section 5.
2 Related Work
2.1 SumProduct Networks
A sumproduct network (SPN) (Darwiche, 2003; Poon & Domingos, 2011) over a set of random variables (RVs) is a rooted DAG over three types of nodes, namely sums (denoted as ), products (denoted as ) and leaf distributions (denoted as ). A generic node is denoted as and the overall SPN as denoted as . The leaves are distributions over some subset of random variables , prespecified by the user, where often univariate parametrized distributions (e.g. Gaussians (Poon & Domingos, 2011; Peharz et al., 2014)) are assumed. This, however, is no requirement, as in fact any kind of leaf distributions might be used in the SPN framework. In particular, we will use GPs as SPN leaves in this paper.
An internal node () computes either a weighted sum or a product, i.e. or , where are the children of node . Note that each edge emanating from a sum node has a nonnegative weight , where w.l.o.g. we assume (Peharz et al., 2015) that .
We require SPNs to be complete and decomposable, two conditions which are naturally expressed via the scope of nodes. For a leaf , the scope is defined as the set of random variables the leaf is a distribution over, i.e. . For an internal node (sum or product), the scope is defined as . An SPN is complete if, for each sum node, all children of the sum node have an identical scope. An SPN is decomposable if, for each product node, all children have nonoverlapping scopes. These two conditions ensure that sum nodes are proper mixture distributions (with their children as components) and product nodes are factorized distributions (assuming independence among their children). Thus, SPNs can be seen as a hierarchically structured mixture model, recursively using mixtures (sum nodes) and factorization (product nodes), where the recursion base is reached at the leaf distributions. One can see that each node in an SPN represents a distribution over its respective scope, but typically the root of the SPN is used as the model. By this construction, SPNs are both a flexible modelling language and allow exact and efficient inference. Due to decomposability and completeness, marginalization tasks reduces to the corresponding marginalization tasks at the leaves, over their respective smaller scopes. Conditioning can be tackled likewise.
Various ways for learning the parameters of SPNs, i.e. the weights at the sum nodes and the parameters of the leaf distributions, have been proposed. By interpreting the SPN as a latent variables model, the parameters can be learned using expectationmaximization
(Poon & Domingos, 2011; Peharz et al., 2017), Bayesian learning (Rashwan et al., 2016; Zhao et al., 2016a; Trapp et al., 2016) or a concaveconvex procedure (Zhao et al., 2016b). Furthermore, Gens and Domingos (Gens & Domingos, 2012) trained SPNs discriminatively using gradient descent. Subsequently, Trapp et al. (Trapp et al., 2017)introduced a safe semisupervised learning scheme for discriminative and generative parameter learning, providing guarantees for the performance in the semisupervised case. A major challenge in SPNs is learning a valid structure of such networks. Besides handcrafted approaches
(Poon & Domingos, 2011; Peharz et al., 2014), several structure learners (Gens & Domingos, 2013; Peharz et al., 2013; Rooshenas & Lowd, 2014; Vergari et al., 2015; Adel et al., 2015; Trapp et al., 2016; Molina et al., 2018) which try to find good structures in a datadriven approach have been proposed.2.2 Gaussian Processes
A Gaussian process (GP) is defined as any collection of random variables , where any finite subset of
is Gaussian distributed, and whereof any two overlapping finite sets are marginally consistent
(Rasmussen & Williams, 2006). In that way, GPs can naturally be interpreted as distributions over functions , wherein this paper we assume . A GP is uniquely specified by a meanfunction and a covariance function . Given pairs of inputs and outputs, the joint distribution of
is given as a joint Gaussian:(1) 
where the rows of are given by and mean and covariance are given as
(2) 
For regression, we assume and , where is i.i.d. Gaussian noise. Let denote the parameters of the GP for which is the variance of the noise model and are kernel function specific hyperparameters, e.g. in case of a squared exponential kernel function consisting of the variance of the latent function and the respective lengthscales . Given hyperparameters, a training set and a test set
, our main interest is in the posterior predictive distribution of the corresponding function value. First, we consider the joint distribution of the training samples and the function values at the test locations
, i.e.Therefore, for a single data point we arrive at the posterior predictive with mean and variance which are given by
(3)  
(4) 
where , and we use
as a shorthand notation to denote the vector of covariances between the test point and the training data.
As the performance of GPs depend on the selected hyperparameters, these are typically optimized by maximizing the marginal likelihood, which is given by
(5) 
The main computational burdens in Equations 3–5 are the inversion of and in (5) additionally the computation of the determinant. In a naive implementation using Gaussian elimination, all three computations scales in . These computational burdens together with the memory requirements, which are , make GPs unpractical for reasonable large datasets.
3 Deep Mixtures of Gaussian Processes
One common strategy to reduce the inference cost in GP models is to perform the computations by independent experts, where each GP expert is responsible only for a small subset of data points. First, let us consider the simple case of a single input and a single output . A simple yet effective approach to reduce inference complexity is to select a splitting point and define two local GPs, where one GP is an expert for the data points and the other for . Thus, we have split an infinite collection of RVs into two collections, one indexed by and the other by . This can also be understood as a fixed gating network (Rasmussen & Ghahramani, 2001), which assigns all data points to one GP expert and all other points to the other GP expert. Instead of considering a single split point , we can generalize this idea to several split points, each yielding a GP with a different independence assumption. Finally, by taking a mixture over the different splitting positions, we obtain a mixture of GP experts which replaces unconditional independence with a conditional independence.
Incorporating the outlined idea into SPNs results in a natural definition of SPNs with GP leaves. More precisely, an SPNGP (SPN with GP leaves) is a hierarchically structured mixture model which recursively combines mixtures over subSPNGPs or GPs with subdivisions of the input space at product nodes. First, let us consider an SPN consisting of a single sum and many product nodes, each splitting the infinite collection of input RVs at different split points into subcollections. Further, let all weights of the network be uniform, i.e. with . Therefore, the prior over latent functions of an SPNGP is,
(6) 
where is the set of latent functions of the GP experts and denotes their respective hyperparameters. Clearly, the resulting model is a mixture of GP experts.
More generally, we can express the prior over latent functions of any SPNGP in terms of a mixture over induced trees (Zhao et al., 2016a), i.e.
(7) 
which is an exponentially large mixture efficiently represented by the hierarchical structure of the SPN. For better distinguishability from product nodes which, as in classic SPNs, partition the output variables, we call product nodes that split a collection of input RVs at split points split nodes throughout the rest of the paper.
An illustration of an SPNGP with split nodes, product nodes and different types of kernel functions is shown in Figure 1. As shown in the illustration, the SPNGP model is a very flexible probabilistic regression model which allows to: (1) mix over different blockdiagonal covariance representations (cf. in the illustration) (2) mix over GPs with different kernel functions (cf. ), (3) hierarchically subdivides the input space (cf. the hierarchy of and ) and (4) partitions the dependent variables subject to the covariates (cf. in the figure). Further, this model allows accounting for structure in the input and the output space.
As an SPNGP model is a deep structured mixture model over GP experts, the computation of the mean and variance for an unseen data point is given by
(8)  
(9)  
(10) 
where selects the GP expert of the induced tree responsible for .
3.1 Exact Posterior Inference
In the Bayesian setting we wish to update the prior distribution over functions defined by an SPNGP based on an observed dataset . Under the usual i.i.d. assumption, the posterior can be written as
(11) 
Here is an SPNGP, i.e. it can either be i) a mixture over SPNGPs / GPs (sum nodes), ii) a product over SPNGPs / GPs (product node), or iii) a GP (leaf). In the first case we can write (11) as
(12) 
i.e. we can “pull” the likelihood terms over the sum, where is the set of data point indices which are assigned to node . In case ii), we can write (11) as
(13) 
where with . Inductively repeating this argument for all internal nodes, we see that we obtain an unnormalized posterior by multiplying each GP leaf with its local likelihood terms. We can therefore efficiently perform exact posterior updates in SPNGPs by application of posterior inference at the leaves and renormalization of the SPN (Peharz et al., 2015).
[Best viewed in colour] (left) Density region of an SPNGP, coloured in grey, in comparison to a traditional GP, indicated in red (solid line shows the mean and the dotted line the 95% confidence interval). As shown, the subdivision of the input space can result in discontinuities in the density region. (right) However, these discontinuities can be drastically reduced by allowing training points at region boundaries to flow across the boundary. Note that the SPNGP is able to select optimal kernel functions depending on the input.
3.2 Hyperparameter Optimization
As for GPs, we can find optimal hyperparameters by maximizing the marginal likelihood of an SPNGP. Due to decomposability of valid SPNGPs, cf. previous section, we can pull down the integration over latent functions to the leaf nodes of the network, i.e.
(14)  
where and are the observed points and function values at leaf node . Therefore, to find appropriate hyperparameters for the GP experts in an SPNGP, we can maximize the marginal likelihood of each expert independently or maximize the marginal likelihood of the SPNGP jointly, if hyperparameters are tied.
3.3 Structure Learning
To learn the structure of SPNGPs we extend the approach introduced by Poon and Domingos (Poon & Domingos, 2011) to construct network structures for different regression scenarios. Our algorithm first constructs a region graph (Poon & Domingos, 2011; Dennis & Ventura, 2012) which hierarchically splits the input space into small regions. The recursive subdivision stops if a constructed region contains less than
many points or if a region cannot be split into subregions anymore. Our approach constructs axis aligned partitions at each level of the recursion with the dimension in question being randomly select. Starting from the region graph, each leaf region is equipped with multiple experts, each of which has an individual kernel function. These experts are combined in the network structure and combinations of these experts, i.e. induced trees, are weighted according to their plausibility. Therefore, the proposed model naturally is a weighted mixture over local experts with each component having multiple local experts with different kernel functions. We refer to the appendix for further details on the structure learning algorithm. Note that for uniformly distributed observation and equally sized subregions under split node, SPNGPs constructed with our structure learning algorithm reduce the computation complexity of GPs, i.e.
, to where is the number of splitting schemes ( in all experiments), is the number of kernel functions and is the depth of the network, i.e. the number of consecutive split nodes.4 Experiments
To evaluate the effectiveness and investigate the properties of SPNGPs for probabilistic regression, we conducted a series of qualitative and quantitative evaluations. The first two datasets, i.e. the motorcycle and the synthetic dataset, are used to analyze the capacities and challenges arising in the SPNGP model. While we used three UCI datasets from the UCI machine learning repository (Dheeru & Karra Taniskidou, 2017) to quantitatively evaluate the performance of SPNGPs against linear models and traditional GPs and additionally conducted an evaluation using the Kin40K dataset to assess the performance of SPNGPs against other models which leverage hierarchies of local experts.
4.1 Qualitative Evaluation
To analyze the capacities and the challenges arising in SPNGPs, we qualitatively evaluated the SPNGP model on the motorcycle dataset (Silverman, 1985) and on a synthetically generated time series with nonstationarities. The synthetic dataset was generated using the function demo_epinf described in (Vanhatalo et al., 2013) and has been used in prior work to asses the performance of GP models for nonstationary series, e.g. (Heinonen et al., 2016). We used a minimum subregion size of , linear and squared exponential kernel functions for the motorcycle dataset and linear and Matérn kernel functions for the synthetic dataset. In both cases we optimized hyperparameters while for the motorcycle dataset we fixed the noise variance to . For comparison, we used a traditional GP with a squared exponential kernel function and optimized the hyperparameters as in the case of the GPSPN by maximizing the marginal loglikelihood.
Figure 2 shows the density region of the SPNGP in grey and illustrates the GP in red (mean and 95% confidence interval). We can see that the independence assumptions made be the SPNGP result in discontinuities in the density region. In particular, the boundary around position results in strong discontinuities due to the enforced independence assumption and the change of kernel function. However, as illustrated in Figure 2 (righthand side) those discontinuities can be drastically reduced by accounting for outlying data points close to a regional boundary during posterior inference at the leaves. This approach lets us account for dependencies of points spatially close to the boundary. Note that care has to be taken to ensure the validity of the SPNGP, e.g. updates of the weights have to exclude the additional boundary points. The approach is based on a similar approach recently used in a hierarchical model (Ng & Deisenroth, 2014). We can observe that in both cases, the SPNGP is able to weight the different kernel functions according to their plausibility for each region independently. In particular, in regions with mostly linear dependencies, the linear kernel function was selected as the most plausible kernel, while in other regions the network prefers to use a squared exponential kernel function.
Figure 3
shows the density region of the SPNGP and the mean and 95% confidence interval of a GP for nonstationary data with conditional heteroscedastic noise. Even for more complex functions, the SPNGP is able to match the general shape of the traditional GP. Further, we can see on Figure
3 (righthand side) that the SPNGP model is able to adjust the noise variance parameter depending on the input (black line) approximately matching the true noise generating function (blue line). Even though the SPNGP model is not explicitly designed to work for nonstationary time series, the subdivision of the input space allows the model to naturally account for input dependent hyperparameters. We believe that this is an exciting side effect of the SPNGP model and illustrates the flexibility of the model.4.2 Quantitative Evaluation
To quantitatively evaluate SPNGPs we computed the root mean squared error (RMSE) on different UCI datasets and on the Kin40k dataset. On the UCI datasets, we assessed the performance of the SPNGP model against the mean prediction (Mean), linear least squares (LLS), ridge regression (Ridge) with
and a Gaussian process (GP) for those datasets where the computation is feasible on a MacBook Pro with 8 GB RAM. For the GP model, we used the best performing kernel function (out of linear, Matérn and squared exponential kernel) with fixed hyperparameters. We tried to optimize the hyperparameters of the GP but obtained worse results than using handpicked parameters. We suspect that the decrease in performance is due to overfitting of the GP on the training examples. Therefore, we did not optimize the hyperparameters of the GP and the SPNGP model for fair comparisons. Note that the hyperparameters have been selected in favour of the traditional GP model. We equipped the SPNGP model with linear, Matérn and squared exponential kernel functions and learned the structure with , and a recursive subdivision of the input space into four equally sized subregions. We additionally evaluated the effect of overlapping, as discussed in Section 4.1 denoted as SPNGP.The resulting RMSE and standard errors computed over five independent runs are listed in Table
1. On the Concrete dataset the SPNGP model achieves competitive results compared to a GP regressor. While on the Energy dataset our model outperforms traditional GPs when overlapping regions are used. Further, we found a positive effect of reducing the discontinuities, cf. SPNGP in Table 1, in two out of three datasets indicating that the performance of the SPNGP sometimes suffers from the independence assumptions made in the model architecture.Method  Energy  Concrete  CCPP 

Mean  
LLS  
Ridge  
GP    
SPNGP  
SPNGP 
Additionally, we assessed the approximation error on the Kin40K dataset as described in (Deisenroth & Ng, 2015). The Kin40K dataset consists of training examples and test points where each data point represents the forward dynamics of an 8link allrevolute robot arm. The dependent variable in this dataset is the distance of the endeffector from a target. We followed the approach described in (Deisenroth & Ng, 2015) and used hyperparameters of a squared exponential kernel (with ARD) estimated using a full GP and compared the RMSE against different numbers of data points per expert. As our structure learning algorithm does not guarantee to construct subregions with an equal number of training points, we depict the average number of training points per expert for SPNGPs. The performance of SPNGPs compared to state of the art approaches is shown in Figure 4. Note that our model architecture used in this experiment is comparable to prior work, i.e. only a single kernel function but mixing over split positions is used, and is not more expressive than a full GP. As shown in Figure 4 the RMSE of SPNGPs rises only slightly with declining number of training points per expert indicating that our approach does not suffer from vanishing predictive variances or the effect of weak experts.
5 Discussion and Conclusion
In this work, we have introduced sumproduct networks with Gaussian process leaves (SPNGPs) as an efficient and effective probabilistic nonlinear regression model. In our model, the inference cost is effectively reduced in a divide and conquer fashion by recursively subdividing the input space into small subregions and distributing the inference onto local experts. We showed that SPNGPs have a variety of appealing properties, such as (1) computational and memory efficiency, (2) efficient and exact posterior inference, (3) the ability to weight different kernel functions according to their plausibility, and (4) naturally account for nonstationarities in data.
We found that the SPNGP model is on par with or outperforms traditional GP models. Moreover, we could demonstrate that our model does not suffer from vanishing predictive variances or the effect of weak experts. In addition, we assessed the properties and challenges of our model qualitatively. We found that SPNGPs naturally account for nonstationarities in time series and that SPNGPs are able to effectively infer the appropriate kernel function depending on the input. Our experiments indicate that the SPNGP model is a very promising model for nonlinear probabilistic regression not only in largescale data domains. Further, we believe that this work will lead to interesting new approaches for regression in highdimensional data regimes, e.g. hyperparameter optimization. In future work, we will investigate the application of SPNGPs for largescale realworld problems and work on datadriven approaches for structure learning. Further, we plan to investigate the effects of parameter tying for regularization in SPNGP models.
Acknowledgments
This research is partially funded by the Austrian Science Fund (FWF): P 27803N15.
References

Adel et al. (2015)
Adel, T., Balduzzi, D., and Ghodsi, A.
Learning the structure of sumproduct networks via an SVDbased
algorithm.
In
Proceedings of Conference on Uncertainty in Artificial Intelligence (UAI)
, 2015.  Bauer et al. (2016) Bauer, M., van der Wilk, M., and Rasmussen, C. E. Understanding probabilistic sparse gaussian process approximations. In Proceedings of Advances in Neural Information Processing Systems (NIPS), pp. 1525–1533, 2016.
 Candela & Rasmussen (2005) Candela, J. Q. and Rasmussen, C. E. A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research, 6:1939–1959, 2005.
 Cao & Fleet (2014) Cao, Y. and Fleet, D. J. Generalized product of experts for automatic and principled fusion of gaussian process predictions. CoRR, abs/1410.7827, 2014.
 Chu & Ghahramani (2005) Chu, W. and Ghahramani, Z. Preference learning with gaussian processes. In Proceedings of the International Conference on Machine Learning (ICML), pp. 137–144, 2005.

Darwiche (2003)
Darwiche, A.
A differential approach to inference in Bayesian networks.
Journal of the ACM, 50(3):280–305, 2003.  Deisenroth & Ng (2015) Deisenroth, M. P. and Ng, J. W. Distributed gaussian processes. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1481–1490, 2015.
 Dennis & Ventura (2012) Dennis, A. W. and Ventura, D. Learning the architecture of sumproduct networks using clustering on variables. In Proceedings of Advances in Neural Information Processing Systems (NIPS), pp. 2042–2050, 2012.
 Dheeru & Karra Taniskidou (2017) Dheeru, D. and Karra Taniskidou, E. Uci machine learning repository. University of California, Irvine, School of Information and Computer Sciences, 2017. URL http://archive.ics.uci.edu/ml.
 Gal & Turner (2015) Gal, Y. and Turner, R. Improving the gaussian process sparse spectrum approximation by representing uncertainty in frequency inputs. In Proceedings of the International Conference on Machine Learning (ICML), pp. 655–664, 2015.
 Gens & Domingos (2012) Gens, R. and Domingos, P. Discriminative learning of sumproduct networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), pp. 3248–3256, 2012.
 Gens & Domingos (2013) Gens, R. and Domingos, P. Learning the structure of sumproduct networks. Proceedings of the International Conference on Machine Learning (ICML), pp. 873–880, 2013.
 Heinonen et al. (2016) Heinonen, M., Mannerström, H., Rousu, J., Kaski, S., and Lähdesmäki, H. Nonstationary gaussian process regression with hamiltonian monte carlo. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 732–740, 2016.
 Hensman et al. (2013) Hensman, J., Fusi, N., and Lawrence, N. D. Gaussian processes for big data. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), 2013.
 Molina et al. (2018) Molina, A., Vergari, A., Mauro, N. D., Natarajan, S., Esposito, F., and Kersting, K. Mixed sumproduct networks: A deep architecture for hybrid domains. In Proceedings of the Conference on Artificial Intelligence (AAAI), 2018.
 Ng & Deisenroth (2014) Ng, J. W. and Deisenroth, M. P. Hierarchical mixtureofexperts model for largescale gaussian process regression. CoRR, abs/1412.3078, 2014.
 Park et al. (2011) Park, M., Horwitz, G., and Pillow, J. W. Active learning of neural response functions with gaussian processes. In Proceedings of Advances in Neural Information Processing Systems (NIPS), pp. 2043–2051, 2011.
 Peharz et al. (2013) Peharz, R., Geiger, B., and Pernkopf, F. Greedy partwise learning of sumproduct networks. In Proceedings of ECML/PKDD, pp. 612–627. Springer Berlin, 2013.
 Peharz et al. (2014) Peharz, R., Kapeller, G., Mowlaee, P., and Pernkopf, F. Modeling speech with sumproduct networks: Application to bandwidth extension. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3699–3703, 2014.
 Peharz et al. (2015) Peharz, R., Tschiatschek, S., Pernkopf, F., and Domingos, P. M. On theoretical properties of sumproduct networks. In Proceedings of the Conference on Artificial Intelligence and Statistics (AISTATS), 2015.
 Peharz et al. (2017) Peharz, R., Gens, R., Pernkopf, F., and Domingos, P. On the latent variable interpretation in sumproduct networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 39(10):2030–2044, 2017.
 Poon & Domingos (2011) Poon, H. and Domingos, P. M. Sumproduct networks: A new deep architecture. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), pp. 337–346, 2011.
 Rana et al. (2017) Rana, S., Li, C., Gupta, S. K., Nguyen, V., and Venkatesh, S. High dimensional bayesian optimization with elastic gaussian process. In Proceedings of the International Conference on Machine Learning (ICML), pp. 2883–2891, 2017.

Rashwan et al. (2016)
Rashwan, A., Zhao, H., and Poupart, P.
Online and distributed bayesian moment matching for parameter learning in sumproduct networks.
In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 1469–1477, 2016.  Rasmussen & Ghahramani (2001) Rasmussen, C. E. and Ghahramani, Z. Infinite mixtures of gaussian process experts. In Proceedings of Advances in Neural Information Processing Systems (NIPS), pp. 881–888, 2001.
 Rasmussen & Williams (2006) Rasmussen, C. E. and Williams, C. K. I. Gaussian processes for machine learning. Adaptive computation and machine learning. MIT Press, 2006. ISBN 026218253X.
 Rasmussen et al. (2003) Rasmussen, C. E., Kuss, M., et al. Gaussian processes in reinforcement learning. In Proceedings of Advances in Neural Information Processing Systems (NIPS), pp. 751–758, 2003.
 Rooshenas & Lowd (2014) Rooshenas, A. and Lowd, D. Learning sumproduct networks with direct and indirect variable interactions. In Proceedings of the International Conference on Machine Learning (ICML), pp. 710–718, 2014.
 Shen et al. (2005) Shen, Y., Ng, A. Y., and Seeger, M. W. Fast gaussian process regression using kdtrees. In Proceedings of Advances in Neural Information Processing Systems (NIPS), pp. 1225–1232, 2005.
 Silverman (1985) Silverman, B. W. Some aspects of the spline smoothing approach to nonparametric regression curve fitting. Journal of the Royal Statistical Society, 47(1):1–52, 1985.
 Trapp et al. (2016) Trapp, M., Peharz, R., Skowron, M., Madl, T., Pernkopf, F., and Trappl, R. Structure inference in sumproduct networks using infinite sumproduct trees. In NIPS Workshop on Practical Bayesian Nonparametrics, 2016.
 Trapp et al. (2017) Trapp, M., Madl, T., Peharz, R., Pernkopf, F., and Trappl, R. Safe semisupervised learning of sumproduct networks. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), 2017.
 Tresp (2000) Tresp, V. A bayesian committee machine. Neural Computation, 12(11):2719–2741, 2000.
 Vanhatalo et al. (2013) Vanhatalo, J., Riihimäki, J., Hartikainen, J., Jylänki, P., Tolvanen, V., and Vehtari, A. Gpstuff: Bayesian modeling with gaussian processes. Journal of Machine Learning Research, 14(1):1175–1179, 2013.
 Vergari et al. (2015) Vergari, A., Mauro, N. D., and Esposito, F. Simplifying, regularizing and strengthening sumproduct network structure learning. In Proceedings of European Conference on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD), pp. 343–358, 2015.
 Williams & Seeger (2000) Williams, C. K. I. and Seeger, M. W. Using the nyström method to speed up kernel machines. In Proceedings of Advances in Neural Information Processing Systems (NIPS), pp. 682–688, 2000.
 Zhao et al. (2016a) Zhao, H., Adel, T., Gordon, G. J., and Amos, B. Collapsed variational inference for sumproduct networks. In Proceedings of International Conference on Machine Learning (ICML), pp. 1310–1318, 2016a.
 Zhao et al. (2016b) Zhao, H., Poupart, P., and Gordon, G. J. A unified approach for learning the parameters of sumproduct networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS), pp. 433–441, 2016b.
Appendix A Structure Learning
To learn the structure of SPNGPs we extended the approach by Poon and Domingos (Poon & Domingos, 2011) to construct network structures for different regression scenarios. Our Algorithm 2 first constructs a region graph (Poon & Domingos, 2011; Dennis & Ventura, 2012) by using Algorithm 1 which hierarchically splits the input space into small regions. In particular, our algorithm starts from the root region and selects an axis at random. This axis is then used to partition the input space into disjoint subregions. Subsequentially, each subregion is recursivelly processed in the same way as the root region. The recursive subdivision stops if a constructed region contains less the many points or if a region cannot be split into subregions.
After constructing the region graph Algorithm 2 equips each leaf region with a set of GP experts, each using a different kernel function. And equips the partitions with split nodes and the internal regions with sum nodes resepectivelly. Note that it is possible to mix over different partitions of the input space by constructing partitions with different offsets under the root region. However, this approach can lead to an increase of computational complexity and the construction therefore has to be done carefully.
Appendix B Quantiative Evaluation
To quantitatively evaluate SPNGPs we evaluated the performance of our model on the UCI datasets listed in Table 2.
Data set  

Energy  768  8  2 
Concrete  1030  8  1 
CCPP  9568  4  1 