1 Introduction
In the era of machine learning (ML) and uncertainty quantification
(UQ), it is not surprising to see their boundary getting progressively blurred. Crossfertilization between the two disciplines is nowadays the norm, rather than an exception, and for good reasons. Physicsinformed neural networks are reaching unprecedented approximation power in UQ applications (see,
e.g., (Raissi et al., 2019; Pang et al., 2019)), while sparse polynomial chaos expansions are used as denoising regressors in Torre et al. (2019a), as highdimensional regression tools in Lataniotis et al. (2020), and on realworld experimental data in Abbiati et al. (2021). UQborn Gaussian process modeling Santner et al. (2003); Rasmussen and Williams (2006) is now a staple tool in ML Rasmussen and Williams (2006), while support vector machines
(Vapnik, 2013)found their way in rare event estimation
(Bourinet, 2016; Moustapha et al., 2018).More in general, the adoption of both surrogate models and ML is becoming mainstream in applied sciences and engineering, with applications in entirely different fields. A few examples from the recent literature include: macroeconomics Harenberg et al. (2019), wind turbine modeling Slot et al. (2020), nuclear engineering Radaideh and Kozlowski (2020), smart grid engineering Wang et al. (2020), crash test simulations Moustapha et al. (2018), dam stability assessment Guo and Dias (2020), and many more. This list could be extended arbitrarily, as does the rich literature on these topics, but this task lies outside the scope of the current paper.
A common aspect across all of these works is the use of efficient and accurate functional approximation tools. Regardless of the specific technique, the general concept is straightforward: given a finite set of input realizations and their corresponding model responses, known as the training set (ML) or experimental design (UQ), a suitable parametric function is calibrated such that it accurately approximates the underlying (possibly unknown) inputoutput map. For the sake of consistency, and a little bias towards UQ, we will refer to this process as surrogate modeling, acknowledging that it is also known as emulation, metamodeling, reduced order or response surface modeling, or sometimes simply regression. A variety of methods is available in the surrogate modeling literature, which we cluster here in two classes:

Localized surrogates:
this includes interpolants (
e.g. Gaussian process modeling Santner et al. (2003), spline interpolation Reinsch (1967), sparse grids (Bungartz and Griebel, 2004)), but also local regression methods (e.g. Gaussian process regression Rasmussen and Williams (2006), multivariate moving averages Lowry et al. (1992) or support vector machines (Vapnik, 2013)). These techniques rely on the availability of local information, e.g. through kernels on pointwise distance measures or support vectors, to provide predictions that are more accurate closer to the points in the training set. They therefore tend to perform better in interpolation, rather than extrapolation, tasks. 
Global surrogates: they provide global approximations without capitalizing on locally available information. Examples in this class include spectral methods (e.g. polynomial chaos expansions (PCE) Xiu and Karniadakis (2002); Blatman and Sudret (2011) and Pointcaré expansions Roustant et al. (2017)),linear regression methods (e.g. compressive sensing Donoho et al. (2006); Lüthen et al. (2020), generalized linear models Nelder and Wedderburn (1972)), artificial neural networks (ANNs) Goodfellow et al. (2016). These techniques tend to achieve better global accuracy (e.g. in terms of generalization error), thus offering some degree of extrapolation capabilities, but also worse local accuracy than their localized counterparts.
Each of the two classes have advantages and disadvantages, but they both tend to perform well on models that show homogeneous complexity throughout the input parameter space. Some models of practical engineering relevance, however, can show a highly localized behavior in different regions of the parameter space. Common examples include likelihood functions used in Bayesian inference
Nagel and Sudret (2016), crash test simulations Serna and Bucher (2009), snapthrough models Hrinda (2010), and discontinuous models in general.Different approaches with varying degree of complexity have been proposed in the UQ and ML literature to address this kind of behavior. Examples include regression trees (Chipman et al., 2010; Breiman, 2017)
, multivariate adaptive regression splines (MARS,
(Friedman, 1991)), various combinations of Kriging and PCE (PCKriging, Schöbi et al. (2015); Kersaudy et al. (2015)), multiresolution/multielement polynomial chaos expansions Maître et al. (2004); Wan and Karniadakis (2006); Foo et al. (2008) and deep neural networks Goodfellow et al. (2016), among others. Such methods can be broadly classified in two macrofamilies:
global approximations with local refinements (e.g. PCKriging), or domaindecompositionbased methods (regression trees, MARS, multielement polynomial chaos expansions). The class of global approximations with local refinements rely on efficiently combining global surrogates (e.g. spectral decompositions as polynomial chaos expansions, or global regression models) with local interpolation techniques (e.g. Gaussian processes or splines), to provide surrogates with acceptable extrapolation capabilities and good local accuracy. The class of domaindecompositionbased methods relies instead on the idea of partitioning the input parameter space into (often disjoint) subdomains, followed by the use of regressionbased surrogates in each subdomain. This divideandconquer approach is particularly effective in reducing the complexity of the computational model in each subdomain, hence allowing relatively simple techniques to be used as local approximants. A prime example of this class of methods is given by regression trees (Chipman et al., 2010), where the local surrogates are as simple as constant values.A common trait of most surrogate models used in a UQ context is that they rely on some form of regularity of the underlying computational model (e.g. smoothness or symmetry) to achieve an efficient representation based on an experimental design of relatively small size. It is therefore not surprising that they often show limited scalability with both the number of input dimensions (the well known curse of dimensionality) and with the size of the experimental design. Indeed, most local surrogates and interpolants rely on either kernel or clustering methods, neither of which scales linearly with the number of dimensions. Moreover, their training requires the solution of complex optimization problems that often have at least as many parameters as input dimension Rasmussen and Williams (2006); Vapnik (2013). Global regression methods, on the other hand, require the optimization and storage of a large number of parameters or coefficients, which also rarely scales linearly in high dimension for nontrivial models.
To step further into scalability considerations, the number of available samples in the experimental design deserves some discussion. Historically, UQbased surrogate modeling has taken a parsimonious approach: focus on small but informative experimental designs (), because of the high computational costs associated to engineering models, and to their smooth behavior. On the other hand, ML has seen its expansion in the era of big data, focusing on large experimental designs (), with often noisy data and highly nonsmooth behavior. Albeit the gap is closing over time, a noman’s land in between the two still exists: computational models that show a complexity that is too high for classical surrogate modeling (e.g. extremely nonlinear, or highly localized), but are expensive enough to only allow for , regardless of the input dimension.
It is with this class of problems in mind that we propose a new surrogate modeling technique, namely stochastic spectral embedding (SSE), that combines global spectral representations and adaptive domain decompositions. We demonstrate that SSE can efficiently approximate models with varying degrees of complexity across the input space, while maintaining favorable scaling properties with both the input dimension and the size of the experimental design.
The paper is organized as follows: we first describe the general rationale and the details of the algorithm in Section 2. Then, in Section 3 we tackle the issue of constructing an SSE from an experimental design, in a regression context. In Section 4, we choose a reference spectral decomposition technique (polynomial chaos expansions, PCE) and we apply SSE to two highly complex analytical functions to showcase its capability to adapt to models with nonhomogeneous complexity, and its scalability to high dimensions and large experimental designs. Finally, we also tackle two models of engineering complexity that are known to be challenging for classical surrogate modeling methods. We present concluding remarks in Section 5 and discuss extensions of the algorithm that could further improve its performance.
2 Stochastic spectral embedding: rationale and main algorithm
As the name suggests, stochastic spectral embedding (SSE) is a combination of two main ingredients: a stochastic spectral representationbased surrogate model and some form of embedding, which implies the sequential construction of subdomains of the full input space. In other words, SSE consists in iteratively refining a spectral surrogate model by means of embedding additional surrogate models in subdomains of the parent expansion. In a sense, SSE can be seen as an extension of regression trees Breiman (2017); Chipman et al. (2010) to a much wider class of regression models, with the addition of a strong stochastic component due to the use of spectral representations.
2.1 Spectral expansions
We will consider herein the Hilbert space
of random variables of the form
with finite second moments (
), where is andimensional random vector with joint distribution
. Let the space be equipped with the inner product:(1) 
where is the support of . Then, every admits a spectral representation of the form:
(2) 
where the are real coefficients, and the ’s form a countably infinite orthonormal basis of the space:
(3) 
where is the Kronecker delta. For notational simplicity, the inner product subscript is omitted hereinafter.
Spectral decompositions of the form of Eq. (2) have a property that is particularly important for surrogate modelling, namely the fact that due to the orthogonality of the basis in Eq. (3), their (finite) second moment is given by:
(4) 
The converging sum in Eq. (4) implies therefore that the coefficients must decay at least geometrically when sorted by decreasing absolute value. This property is sometimes referred to as compressibility, because it essentially means that most of the information on the model variability is contained in a finite set of coefficients/basis elements. This allows one to truncate the spectral decomposition in Eq. (2) even if in principle it has an infinite number of terms. The truncated version of Eq. (2) is given by:
(5) 
where is a truncation set (typically related to the complexity of the basis functions, e.g., maximum frequency in Fourier expansions, or maximum polynomal degree in PCE).
The rapid decay in the coefficients of spectral expansions is the main reason why many powerful surrogate modeling techniques that belong to the socalled class of compressive sensing (Donoho
et al., 2006; Candès and
Wakin, 2008), have proven to be very effective in various recent applications (Blatman and
Sudret, 2011; Torre
et al., 2019a; Lüthen
et al., 2020).
Compressive sensing uses sparse regression tools to identify the best truncation set based on the available information in the experimental design.
Because of the truncation introduced in Eq. (5), the expansion is in general not exact, hence we define the residual as:
(6) 
Due to the convergent behavior of the truncated expansion, it follows that . By definition, spectral expansions belong to the class of global representations, i.e. the basis functions in Eq. (5) have support on the entire domain . Therefore, highly localized models, or those with inhomogeneous behavior throughout the input domain tend to require an extremely large number of terms in the truncated expansion to achieve satisfactory approximation accuracy (Gibbs phenomenon). As an example, the number of terms in the wellestablished polynomial chaos expansion (Xiu and Karniadakis, 2002; Blatman and Sudret, 2011) can grow very fast when the underlying model has strongly localized behavior, because a high polynomial degree is required for an accurate representation.
2.2 A sequential partitioning approach
To alleviate this limitation, while still capitalizing on the powerful convergence properties of spectral methods, SSE constructs a sequence of spectral expansions of manageable complexity on increasingly smaller subdomains of the original domain. Such subdomains are denoted , where is the expansion level and is a subdomain index within the level. The expansion is performed only on the local residual from the previous level.
For illustration purposes, Figure 1 shows an example of sequential partitioning for a simple 2D bounded domain, obtained by splitting each subdomain in two equal subdomains across a random dimension. When , there is only a single subdomain and the residual is from Eq. (6).
While the idea of partitioning the input space in smaller subdomains and constructing a surrogate model in each is certainly not new in UQ (see, e.g. Maître et al. (2004); Wan and Karniadakis (2006)), the use of a sequence of residuals in SSE sets it apart from other divide and conquer methods.
In more formal terms, SSE in its basic form can be written as a multilevel expansion of the form:
(7) 
where is the total number of expansion levels considered, is the number of subdomains at level , is the indicator function of the subdomain . Finally, is the truncated expansion of the residual of the SSE up to level , ,
(8) 
In the above equations, each residual term is expanded onto a local orthonormal basis as follows:
(9) 
A local inner product is defined in the domain :
(10) 
where:
(11) 
is the joint PDF of the input parameters restricted to the subdomain
and rescaled by its probability mass
:(12) 
A crucial aspect of SSE is that by partitioning the entire input domain into smaller subdomains , it trades the complexity of the single, often global expansion in Eq. (5) for a (possibly large) number of local expansions with much smaller truncation sets. In cases where the spectral basis is continuous in Eq. (5
), SSE results in a final piecewise continuous approximation, but no continuity is ensured on the boundaries of the subdomains. Meansquare convergence of the procedure is guaranteed by the spectral convergence in each level, which implies that the residual local variance in each subdomain is in expectation decreasing rapidly. In other words, for each increasing level
in Eq. (7), new discontinuity bounds are generated during the partitioning step, but the variance of the overall residual is reduced, thus resulting, in expectation, in lower amplitude discontinuities. This behavior is analogous to that of regression trees Friedman (1991); Breiman (2017).2.3 The SSE algorithm
Algorithmically, SSE consists of a local refinement sequence of a global spectral expansion into sequentially smaller subdomains . For notational simplicity, we introduce here a set of local random vectors distributed according to the local PDF in Eq. (11): . We further choose a certain partitioning strategy that is discussed in Section 3.2.
Then, the SSE algorithm can be written as:

Initialization:

,




For each subdomain :

Calculate the truncated expansion of the residual in the current subdomain

Update the residual in the current subdomain

Split the current subdomain in subdomains based on a partitioning strategy

If , , go back to 2a, otherwise terminate the algorithm


Termination

Return the full sequence of and needed to compute Eq. (7).

3 Building a stochastic spectral embedding from data
For it be useful in practical applications, SSE needs to be “trainable” from a finitesize experimental design. Hereinafter, we consider an experimental design and its corresponding model evaluations as the only data available for training.
Upon closer inspection of the algorithm in Section 2.3, the training phase of the SSE representation consists in estimating the following quantities from the available experimental design:

The expansion coefficients of the local residual in each level and subdomain (Eq. (9)).

A partitioning strategy at each level.

The total number of splitting levels, .
In the following sections we introduce a comprehensive adaptive strategy based on sparse linear regression to perform each of these steps from a given experimental design.
3.1 Calculating the residual expansion coefficients
For a specific subdomain , a local spectral expansion of the residual needs to be constructed from the available experimental design. We therefore define a socalled local experimental design , the subset of the original experimental design lying within the subdomain :
(13) 
A similar notation is used to identify the corresponding model responses, . Using the auxiliary local random vector introduced in the previous section, the residual expansion in Eq. (9) reads:
(14) 
Given the local experimental design and a truncated local spectral basis , the task of identifying the coefficients can then be cast as a linear regression problem (see, e.g., Berveiller et al. (2006)):
(15) 
While in principle the regression problem in Eq. (15
) can be solved through ordinary least squares, recent literature on the topic of compressive sensing has amply demonstrated that sparse regression approaches can provide great benefits in terms of accuracy, especially for relatively small experimental designs
Donoho et al. (2006); Blatman and Sudret (2011); Lüthen et al. (2020). A review of the available techniques for this purpose lies outside the scope of this paper and is extensively explored for one popular class of spectral representations (polynomial chaos expansions) in Lüthen et al. (2020).3.2 Partitioning strategy
A second step necessary to construct SSE from data is to identify a proper partitioning strategy between levels. Any strategy for the partitioning of the input domain can be employed for Eq. (7), under the sole condition that at each level :
(16) 
While a comprehensive study on different partitioning strategies would be interesting, for the sake of simplicity we adopt hereinafter a rather simple approach, similar in spirit to regression trees Breiman (2017). In other words, we split every subdomain in two parts of equal probability mass along one of the input directions, . Note that the direction in itself can be different for each subdomain, even on the same level.
Under very general conditions, it is possible to bijectively map any random vector with joint distribution , to the uniform independent random vector through an appropriate isoprobabilistic transform (e.g., the Rosenblatt transform Rosenblatt (1952); Torre et al. (2019b)):
(17) 
where
denotes the isoprobabilistic transform. This mapping simplifies the proposed partitioning strategy: splitting is performed in the uniformly distributed
quantile space , and the resulting split domains are mapped back to the input space via the inverse transform (see Eq. (17)). This has several computational benefits, including proper treatment of unbounded variables. Figure 2shows graphically a twodimensional example of partitioning in the quantile (uniform) space
, and its corresponding mapping to unbounded random variables in the physical space .In the general case, a strategy is needed to choose a specific splitting direction for each existing subdomain
. Different heuristic reasoning can be used to make this choice, including purely random splitting (as in Figure
2), using the direction of maximum residual difference, or estimates of the variability of the in each direction (following the same rationale as in Shields (2018)). The optimal criterion can be applicationspecific, because it may in principle depend on the chosen spectral representation.3.3 Sparse tree representation and expansion truncation
In a regression context, it is difficult to choose a priori a truncation on the maximum number of levels in the expansion in Eq. (7). Because the samples may be unequally distributed, some subdomains at each level may be empty, or more formally for some combinations of and . We therefore take a straightforward approach to obviate this issue, by initializing the residual at every level and subdomain to the null function, hence . The coefficients are then updated only in the subdomains that satisfy , where is a parameter of the SSE algorithm that represents the minimum number of points in a subdomain required to justify an expansion. The SSE expansion is truncated when no new updates are possible given the current experimental design , or more formally:
(18) 
In addition to providing a suitable stopping criterion for the algorithm in Section 2.3, an added benefit of this strategy is that only the residual expansions that were effectively updated need to be stored in memory. This provides a degree of sparsity in the representation and potentially significantly reduces the memory fingerprint of the method, especially in the case of a large number of points in the experimental design.
Note that, for an experimental design of size and a minimum number of points per expansion , the following holds:
(19) 
where is the expected value of the maximum in and therefore
(20) 
3.4 Error estimation
In the context of surrogate modeling, assessing the accuracy of the approximation is an important task. Arguably the best known accuracy estimator in function approximation is the socalled generalization error , which for SSE is given by
(21) 
A direct estimation of this quantity is in general impossible, as it would require the availability of an extensive validation set. Instead, because we adopt a regression approach to calibrate the spectral decompositions in each subdomain, we estimate the generalization error through leaveoneout crossvalidation (Chapelle et al., 2002; Blatman and Sudret, 2010) that is available for each of the local expansions.
For notational convenience, we introduce here the set of terminal domains , i.e. those domains that belong to the last expansion level in Eq. (7). By definition is a complete partition of the input domain . Because of the sequential nature of SSE, which locally refines the previous approximation level with the expansion of the residual in the current subdomain, an accurate estimate of the local generalization error in each of the terminal domains would then suffice to provide an estimate of the overall of the full SSE. If we denote the local residual error:
(22) 
then the global generalization error is simply given by the average error in each terminal domain:
(23) 
which is equal to the sum of the terminal domain errors weighted by the corresponding probability mass in Eq. (12).
To provide an estimator based on the available experimental design , we only need an estimator of . Arguably the most common tool for the estimation of generalization error in regression problems is fold crossvalidation (see, e.g., Vapnik (2013)). The special case of is also known as leaveoneout error and marked . For ordinary least square regression it can be calculated analytically from the expansion coefficients and the basis functions (Chapelle et al., 2002; Blatman and Sudret, 2010).
Given the sparse tree representation described in Section 3.3, it cannot be guaranteed that the residual is expanded in every terminal domain. Therefore, during the splitting phase of the SSE algorithm (Step 2c of the algorithm in Section 2.3) we initialize the error of all the subdomains of the current subdomain to the leaveoneout error of the latter :
(24) 
Then, we update the error estimate in each subdomain during Step 2a only if the conditions for its expansion hold (see Section 3.3). As a result, every terminal domain is either assigned its own leaveoneout error if it contains a residual expansion, or inherits the leaveoneout error from the last ancestor domain that was expanded.
By using the leaveoneout error in each terminal domain as an estimator of its generalization error , the empirical estimator of Eq. (23) reads:
(25) 
In most metamodeling applications, it is customary to normalize the estimated error by the variance of the experimental design, to obtain a dimensionless error measure. The relative error is thus defined as:
(26) 
4 Applications
In this section we aim at showing the performance of SSE on a set of applications that can prove challenging for standard metamodeling techniques. Because of its widespread use in the uncertainty quantification of engineering models, we choose as a spectral decomposition technique polynomial chaos expansions Xiu and Karniadakis (2002); Blatman and Sudret (2011) (hereinafter PCE). This choice is also quite convenient due to several specific properties of PCE, that combine well with SSE.
4.1 Synergies with polynomial chaos expansions
By using the same notation as in Section 2, and assuming that has independent components, the truncated polynomial chaos expansion of a finite variance model can be written as Le Gratiet et al. (2016):
(27) 
where is a multiindex that identifies the polynomial degree in each variable, is a suitable truncation set (e.g. containing all multivariate polynomials with degree ), and the
form an orthogonal basis of multivariate polynomials. The latter can be obtained via tensor product of univariate polynomials as follows:
(28) 
where is a polynomial of degree that belongs to the family of univariate polynomials orthogonal with respect to the input PDF of and the inner product in Eq. (1).
An interesting property of the univariate polynomials that synergizes well with our proposed SSE, is that it is possible to construct polynomials orthogonal to almost any input PDF through GramSchmidt orthogonalization (for an extensive review, see Gautschi (2004); Ernst et al. (2012)). In the context of SSE, this property has a powerful implication: in each subdomain the basis elements in Eq. (9) are still polynomial functions of the original input variables .
This property, together with the analytical integrability of polynomials, allows us to derive several statistics of interest of PCEbased SSE analytically. Let us first introduce the notion of flattened representation: because SSE is a polynomial in the original variables in every level and subdomain, this also holds for the terminal domains introduced in Section 3.4. Therefore, one can project the full SSE in Eq. (7) as a local PCE onto each terminal domain:
(29) 
where is a suitable truncation set for the projection to be exact, and the are the corresponding coefficients. The latter can easily be computed either analytically or exactly through quadrature. Note that, while the basis elements in the PCE in Eq. (29) correspond to the in Eq. (9) (they only depend on the input PDF in Eq. (11)), in general the coefficients will not be the same, i.e. in Eq. (9).
Because PCE contains as a basis element the constant term, it is straightforward to demonstrate that the expected value of Eq. (7) reads:
(30) 
which is the weighted mean of all the mean values of the flattened representation in Eq. (29).
Similarly, the variance can be calculated as:
(31) 
A number of other quantities of engineering interest (e.g. conditional variances, Sobol’ sensitivity indices, etc.) can be derived similarly from the flattened representation. A selection of those is reported in A.
From a technical perspective, the flattened representation in Eq. (29) contains all the information needed to evaluate Eq. (7) on a new point, but at a much lower storage cost, as only the final sets of coefficients and basis indices need to be stored. This has additional advantages during the prediction of the response on new points, because it only requires the prediction of a single local expansion in the appropriate terminal domains, rather than that of all of its ancestors as in the original formulation in Eq. (7). More formally, for a point it is sufficient to find for which and evaluate the flattened SSE from Eq. (29) for .
4.2 Example applications and testing strategy
To compare the performance of SSE over sparse PCE, we choose four reference problems of increasing complexity: (i) a onedimensional analytical function with localized nonpolynomial behavior, (ii) a dimensional analytical function with decreasing parametric importance in higher dimensions, (iii) an dimensional engineering model describing the performance function of a damped oscillator and (iv) a threedimensional discontinuous engineering model describing the snaptrough behavior of a truss structure.
Among the ingredients identified in Section 2.3 is a partitioning strategy, to choose the splitting direction in every subdomain. After extensive testing, we found that splitting according to the direction of highest variability of proved to be the most effective, especially for smaller experimental designs. We therefore split each subdomain into two subdomains with equal probability mass, i.e. , along the direction that has the maximum first order Sobol’ indexSobol’ (1993), as analytically derived from the coefficients of (Sudret, 2008).
In all applications we compare the convergence behavior of SSE vs. its spectral counterpart PCE as a function of the experimental design (ED) sizes . To assess the robustness of the results, we consider independent replications of each ED, and provide the results in Tukey boxplots. For each experimental design size, SSE construction is terminated for (see Section 3.3).
As a spectral technique, we adopt the adaptive sparsePCE based on LARS approach developed in Blatman and Sudret (2011) in its numerical implementation in UQLab (Marelli and Sudret, 2014, 2019). Each is therefore a degree and normadaptive polynomial chaos expansion. We further introduce a rank truncation of to cope with high dimensional problems, i.e. we limit the maximum number of input interactions (Marelli and Sudret, 2019) to 2 variables at a time. The truncation set for each spectral expansion (Eq. (27)) thus reads:
(32) 
where
(33) 
The norm is adaptively increased between while the maximum polynomial degree is adaptively increased in the interval , where the maximum degree is a parameter for each case study.
In all examples the SSE performance is compared to standard polynomial chaos expansions on the same ED. These PCEs are constructed with the same adaptive approach used for the SSE expansions. Their maximum degree, however, is denoted by . It is set to the highest value our computational budget admitted for a given dimensionality but at least to .
We compare the performance of SSE and PCE in terms of the relative mean squared error (MSE) , a well known error estimator defined as
(34) 
where is either the PCE or SSE surrogate model. This error measure was estimated with standard Monte Carlo simulations using a large sample of size .
4.3 Application 1: onedimensional analytical function
We first present a simple onedimensional example that is meant to illustrate how SSE behaves with a model on which PCE is expected to fail. The model is given by:
(35) 
where is a uniformly distributed random variable. The first two terms in Eq. (35) (polynomial and sinusoidal) can be accurately approximated by a low degree PCE, while the third term (squared exponential) causes PCE to require extremely high degree due to the localized peak it introduces at (see Figure 3).
In the same Figure we detail four SSE refinement stages, with and . For every step we show on the top panel a graphical representation of the various subdomains identified by the algorithm, with the subdomains of level highlighted in orange. In the middle panel we plot the true model (orange solid line) and the current SSE approximation as a dashed blue line. In the bottom panel, we plot the corresponding residual as a solid blue line in the same vertical scale as in the middle panel, for comparison.
In the first step in Figure LABEL:sub@fig:Ex1:detailedSteps:1 the main trend of the function is identified, leaving a residual that mainly consists of the sine oscillation and the exponential peak. In the following step (Figure LABEL:sub@fig:Ex1:detailedSteps:2) the approximation is not greatly improved in the subdomain , because the available maximum degree is not sufficiently high, resulting in a mostly constant polynomial correction. In subdomain , the same problem is observed and the insufficient maximum degree results only in a small global improvement. In the next step (Figure LABEL:sub@fig:Ex1:detailedSteps:3), the residual in , and is significantly reduced to a very small oscillation around . After the final step (Figure LABEL:sub@fig:Ex1:detailedSteps:4), the overall approximation is has a high accuracy.
From the residual progression it can be seen that the algorithm needs more levels to accurately approximate the target function near regions of high complexity, i.e., near the exponential peak. While this property does not affect the convergence when an experimental design of fixed size is chosen, it can be exploited in adaptive experimental design settings (Wagner et al., 2020).
As expected, the final SSE accuracy increases with the size of the experimental design. In Figure 4, we compare SSE and PCE on a set of experimental design sizes of in terms of their relative mean squared error (MSE, Eq. (34)). PCE is constructed with a maximum adaptive degree of .
At the extremely small experimental design of , the SSE approach is comparable to PCE. As the available experimental design points increase, SSE exhibits faster convergence in RMSE than PCE, and from onwards SSE consistently outperforms PCE in this problem. At larger experimental designs, SSE can accurately reproduce the localized behavior of this test function, while not being constrained by the global nature of PCE basis functions defined on the full domain. At the final ED size of the SSE relative MSE is at least one order of magnitude smaller than the PCE error. There is considerable variability in the relative MSE between individual realizations. This can be attributed to the squared exponential peak in Eq. (35): depending on the input realizations in the experimental design, it is captured better or worse by the available data.
4.4 Application 2: 100dimensional analytical function
With this example we want to explore the scalability of SSE in high dimensional problems. This example uses a variant of a test function introduced in Zhou (1998). We modified the function to have a high nominal dimensionality (), and relatively low effective dimensionality meaning that the majority of the variability can be attributed to a small number of input parameters. It takes the form
(36) 
The factor modifies the original function and serves as a dimensiondependent weight that decays exponentially, as:
(37) 
The input random vector is distributed according to a multivariate standard uniform distribution with independent marginals, .
Two contour crosssection plots of this function are shown between dimensions in Figure LABEL:sub@fig:Ex2:contour:1 and between in Figure LABEL:sub@fig:Ex2:contour:2. The effect of the decay factor is clearly visible and results in a closeto constant behavior along the parametric dimension .
To manage the computational complexity, in this example we limit the maximum polynomial degree in SSE to and compute the SSE with total experimental design sizes of . For PCE, we choose a maximum degree of , which is the maximum degree we could run on a standard desktop computer with 16GB of RAM before incurring memory issues. The resulting comparison between PCE and SSE with respect to the relative MSE is plotted in Figure 6.
In this scenario, the SSE algorithm outperforms PCE on all investigated experimental designs. In fact, PCE seems to benefit from increasing the experimental design only marginally, with a relative error of for all considered values of . SSE shows instead a convincing convergence behavior, by reducing its residual by almost two orders of magnitude across the various experimental designs. Given the high dimensionality of this problem, the approximation power of sparse PCE is limited by the curse of dimensionality, rather than the lack of data. By reducing the complexity of the spectral representation at each level, SSE can better exploit informative datasets without incurring similar computational bottlenecks.
4.5 Application 3: damped oscillator
Damped oscillators are a class of engineering models that is commonly used in structural reliability problems (Dubourg, 2011). This class of problems is known to be often difficult to surrogate, due its high nonlinearity and often local behavior. A sketch of the oscillator considered in this example is displayed in Figure 7. It consists of a primary and secondary system with masses , stiffnesses and damping ratios . The subscripts and denote the primary and secondary system properties, respectively.
In this example we consider the limit state function of the damped oscillator given by
(38) 
where is the force capacity of the secondary spring, is the socalled peak factor and
is the relative displacement between the primary and secondary systems. The meansquare relative displacement of the secondary spring under a white noise base acceleration
is analytically given by:(39) 
where is the white noise intensity, and are the natural frequencies of the two subsystems, and the further abbreviations are used: , , and .
All variables but the peak factor (set to ) are modelled as independent random variables and are summarized in the random vector . Their marginal distributions are lognormal, with the parameters given in Table 1.
Variable  Description  Distribution  Mean  C.O.V. 

primary mass  Lognormal  
secondary mass  Lognormal  
primary spring stiffness  Lognormal  
secondary spring stiffness  Lognormal  
primary damping ratio  Lognormal  
secondary damping ratio  Lognormal  
white noise intensity  Lognormal  
secondary spring force capacity  Lognormal 
For the convergence study, we choose experimental design sizes of . The maximum degrees are set to and for SSE and PCE, respectively. Figure 8 summarizes the results.
This benchmark is known to be quite difficult to approximate with standard surrogate model techniques, as it is clear from the scale of the RMSE in Figure 8. For the two smaller experimental design sizes , both PCE and SSE perform quite poorly, with SSE showing a similar median behavior, but much higher variability. For larger experimental designs, however, the SSE performance improves significantly over that of PCE, until at the relative MSE of SSE is half that of PCE. This behavior is in line with the previous findings: provided enough information, SSE can provide higher expressive power than its static counterpart.
4.6 Application 4: truss with discontinuous snapthrough behavior
As a last example, we address another problem of engineering interest: the geometrically nonlinear twobar truss structure shown in Figure 9.
The structure itself is defined by the initial inclination and length of the two bars. A peculiarity of this structure is that it exhibits the socalled snapthrough behavior. At first, the vertical displacement of such a structure typically increases linearly with an increasing load (Figure LABEL:sub@fig:ex4:setup:1). Once a specific critical load is exceeded, the structure snaps through to another equilibrium point, at which the load can be increased further (Figure LABEL:sub@fig:ex4:setup:2). The main implication of this kind of behavior is that it is discontinuous in some critical regions of the input space, which are in general unknown a priori.
The vertical displacement of the truss tip is related to the angle by
(40) 
At the same time, needs to satisfy the following constitutive equation that depends on the random vector :
(41) 
For a given realization of , this equation can be solved numerically for , the value of which then is used in Eq. (40) to estimate the corresponding vertical displacement.
In this study we set the constants and , and treat the parameters as independent random variables with marginals listed in Table 2 (Moustapha and Sudret, 2019). We investigate experimental designs of sizes and set the maximum polynomial degrees to and for SSE and PCE, respectively.
Variable  Description  Distribution  Mean  C.O.V. 

load  Gumbel  
Young’s modulus  Lognormal  
cross sectional area  Gaussian 
Figure 10 summarizes the convergence behavior of PCE and SSE in this benchmark. For all experimental designs, SSE outperforms sparse PCE.
The dispersion of the RMSE is also significantly improved. These observations can be explained with the wellknown Gibbs phenomenon in spectral representations, that leads to large discrepancies close to discontinuities. The effect is far less severe (although still present) for SSE than for PCE because it is restricted to those subdomains that actually contain the discontinuity. This behavior is investigated more closely in Figure 11, where a cross section through is shown. It is created by setting to its mean value and drawing a map proportional to the pointwise discrepancy in the remaining directions. We adjust transparency of the model response to reflect the underlying joint PDF: solid colors correspond to high probability, fading ones to low probability. Figures LABEL:sub@fig:ex4:comparison:PCEDelta and LABEL:sub@fig:ex4:comparison:SSEDelta show the relative pointwise error at an experimental design size of for PCE and SSE, respectively. Figure LABEL:sub@fig:ex4:comparison:LOO shows instead the domainwise error estimator from Eq. (25).
All plots clearly show the effect of the function discontinuity. For PCE the Gibbs phenomenon is clearly visible. It causes a large error near the discontinuity, with an oscillating error at large distances to the discontinuity. Naturally, SSE also suffers from the same problem, but its effect is only localized close to the discontinuity and it does not affect further regions. Furthermore, the available domainwise local error estimator in Figure LABEL:sub@fig:ex4:comparison:LOO gives a clear indication of local loss of accuracy of SSE. In practical applications this information can be crucial, as it allows one to assess the confidence of the surrogate model predictions, and could be used to adaptively enrich the experimental design close to critical regions (Wagner et al., 2020).
4.7 Error estimation accuracy
For practical applications, it is important that surrogate models offer insights into their prediction accuracy. Most surrogate models offer global confidence bounds on their predictions that are typically computed through crossvalidation techniques (e.g., leaveoneout error), while some techniques also offer pointwise confidence bounds (e.g., Kriging Santner et al. (2003) and bootstrap PCE (Marelli and Sudret, 2018)).
For SSE, the domainwise error estimators of the local expansions give some insight into the local accuracy of SSE, as shown in the last case study (Section 4.6). The weighted sum of those domainwise estimators can be used as a global estimator of the generalization error as proposed in Section 3.4, Eq. (26). To assess the accuracy of this global estimator, in Figure 12 we plot it against the relative MSE on a validation set for all presented case studies. The diagonal dashed line corresponds to perfect error estimation: .
In all the applications significantly underestimates the true error. For comparison, we show the corresponding LOO estimator from PCE, which also exhibits a bias towards lower errors. On the other hand, it is clear that there is a strong correlation the SSE estimators and the true validation error across all applications, which still makes
a strong potential candidate for model selection in future extensions of the method to adaptively select the SSE hyperparameters (
e.g., adaptively choosing ).4.8 Considerations on computational costs and scalability
For a metamodeling technique to be relevant in an engineering context, the associated training and prediction costs need to be negligible with respect to the costs of the underlying physicsbased computational model. In this sense, SSE performs quite well, as its computational costs scale primarily with the size of the experimental design. Indeed, because the residual expansions (Eq. (9)) can be chosen with very low degree or otherwise strict truncation, the driving cost is the total number of subdomains that are expanded in the SSE sparse tree (see Section 3.3). Interestingly, the expected number of expansions depends only on the ratio between the total experimental design size and the minimum number of points in each subdomain needed to perform a residual expansion, , as in (see also Eq. (20)):
(42) 
In other words, the computational complexity increases at most linearly with the number of points in the experimental design. Additionally, the storage costs can be further reduced to expansions when using the flattened representation in Eq. (29). Therefore, as required, the training and evaluation costs of SSE are normally negligible with respect to those needed to produce a training set for any realistic engineering application.
5 Conclusions
In an effort to extend the applicability of the powerful class of spectral decompositionbased metamodels, we propose a novel metamodel technique called stochastic spectral embedding (SSE), that exploits both recent advances in UQ (sparse spectral expansions) and in machine learning (regression trees). While our presentation was general in nature, we showed how well this approach synergizes with sparse polynomial chaos expansions. We also provided analytical formulas to calculate several statistical properties of the resulting model by means of the socalled flattened representation, which has additional benefits in terms of computational costs.
We tested the performance of SSE on both simple test functions and engineeringlike examples of varying dimensionality and complexity, using varying experimental design sizes, and compared it to our best performing sparse PCE. Its generalization capabilities, especially for highly complex models and large experimental designs, outperform PCE in most cases.
We also demonstrated that the associated computational costs of the training of SSE scale linearly in expectation with the number of points in the experimental design. This compares favorably with most metamodeling techniques common in the UQ community (e.g. PCE or Kriging).
This performance, however, comes at the cost of trading the continuity of PCE for the piecewise continuity of SSE. This also implies the loss of the effective generalization error estimate provided by in linear regression. To mitigate this issue, we proposed the error estimate (Eq. (26)). Despite its absolute scale being biased towards lower values, it still shows high correlation with the actual generalization error for all experimental design sizes and dimensions. This is a promising property for further research into providing automatic selection of the hyperparameters of the algorithm (which at the moment are the maximum degree of the residual expansions , as well as the minimum number of points in each subdomain required to expand the residual) and further enhance its performance.
Acknowledgements
The PhD thesis of the second author is supported by ETH grant #44 171.
Appendix A Postprocessing PCEbased SSE
If polynomials chaos expansions are used to construct the residual expansions, several quantities of engineering interest can be computed analytically as a postprocessing step of the final SSE. In this section we derive expressions for (i) conditional expectations, (ii) partial variances and (iii) Sobol’ indices.
a.1 Conditional expectations
Conditional expectations describe the expectation of a multivariate function of a random vector , conditioned on a subset of assuming a fixed value. Let be an independent random vector with PDF . Denote further by and two disjoint index sets such that and by a random subvector with PDF . Additionally define the complementary random vector . The conditional expectation of w.r.t. can then be written as
(43) 
where we used the flattened representation from Eq. (29). This corresponds to marginalizing over the parameters . Due to the local orthonormality of the SSE representation, an analytical expression for this integral can be found as
(44) 
where is the input mass in the marginalized dimensions and . Further, is an auxiliary random variable that is only defined in the subdomain and is a polynomial basis function of the nonmarginalized variables.
The marginalization process in Eq. (43) creates an additional problem: this expression generally contains overlapping subdomains due to the fact that terminal subdomains in the full input space are not necessarily terminal subdomains in the lower dimensional, conditional expectation input space defined by . However, because the basis functions are polynomials, it is once again possible to perform the flattening process (see Section 3.4) and obtain disjoint subdomains. By denoting as the set of terminal subdomains in the conditional input variables , we can rewrite Eq. (44) as
(45) 
where is a suitable multiindex set allowing an exact representation of the polynomials and are the corresponding coefficients.
Comments
There are no comments yet.