A probabilistic framework for approximating functions in active subspaces

09/18/2018 ∙ by Mario Teixeira Parente, et al. ∙ 0

This paper develops a comprehensive probabilistic setup to compute approximating functions in active subspaces. Constantine et al. proposed the active subspace method in (Constantine et al., 2014) to reduce the dimension of computational problems. It can be seen as an attempt to approximate a high-dimensional function of interest f by a low-dimensional one. To do this, a common approach is to integrate f over the inactive, i.e. non-dominant, directions with a suitable conditional density function. In practice, this can be done with a finite Monte Carlo sum, making not only the resulting approximation random in the inactive variable for each fixed input from the active subspace, but also its expectation, i.e. the integral of the low-dimensional function weighted with a probability measure on the active variable. In this regard we develop a fully probabilistic framework extending results from (Constantine et al., 2014, 2016). The results are supported by a simple numerical example.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The notion of active subspaces refers to a recent and emerging set of tools for dimension reduction [8]. Reducing dimensions is one natural approach to simplify computational problems suffering from the curse of dimensionality, a phenomenon that results in an exponential growth in computational cost with increasing dimension. What is regarded as a high dimension is dictated by the actual problem considered. By ”high”, we mean a number of dimensions that leads to excessive computational times, the need of large memory, or even questions of feasibility of the computation.

There exist different approaches besides active subspaces to reduce computational effort, especially in the context of Uncertainty Quantification (UQ) and Bayesian inversion [22]. For example, in [4, 13, 21] the authors consider low-rank approximations for the prior-preconditioned Hessian of the data misfit function to approximate the posterior covariance in computationally intensive linear Bayesian inverse problems. An extension for the nonlinear setting was proposed in [18]. A drawback of these methods is that they still require work in the full, high-dimensional space. A promising way of dimension reduction was proposed in [11], where the authors seek dominant directions in the parameter space that drive the update from the prior to the posterior distribution. These directions span the so-called likelihood-informed subspace (LIS) and are computed using the posterior expectation of the prior-preconditioned Hessian. The authors of [25] propose a new methodology that constructs a controlled ridge approximation [20] of the data misfit such that an upper bound on the KL divergence between the posterior and the corresponding approximation falls under a given threshold. The bound is obtained via logarithmic Sobolev inequalities [14]. The paper also contains a comparison to likelihood-informed and active subspaces. Most of the methods work only for scalar-valued functions. In [24]

, the authors perform gradient-based dimension reduction for functions in Hilbert spaces, i. e. also for vector-valued functions.

Active subspaces also aim to find a ridge approximation to a function of interest . However, it exploits structure of the function’s gradient, more precisely, the (prior-)averaged outer product of the gradient with itself. The technique was already successfully applied for a wide range of complex problems of engineering or economical relevance, e. g. in hydrology [15], for a lithium ion battery model [7], or to an elastoplasticity model for methane hydrates [23].

Independently of the concrete methodology, each approach to reduce dimension aims at unfolding main and dominant information hidden in a low-dimensional structure. Active subspaces concentrate on directions in a computational subdomain in which

is more sensitive, on average, than in other (orthogonal) directions. For that, the eigenvalues and corresponding eigenvectors of an uncentered covariance-type matrix defined by the average of the outer product of the gradient

with itself are studied. The span of eigenvectors with corresponding large eigenvalues form the so-called active subspace. With the active subspace at hand, can be approximated by a low-dimensional ridge function depending on fewer variables.

A common approximation of uses a conditional expectation of over the complement of the active subspace, the inactive subspace, conditioned on the active variable, which is a linear combination of the variables from the original domain [8]. In practice, the conditional expectation is often approximated by a finite Monte Carlo sum. For this type of approximation, only a few samples are generally necessary since the function is, by construction, only mildly varying on the inactive subspace.

The idea of active subspaces was introduced in [8]

and exploited for an accelerated Markov chain Monte Carlo algorithm

[3] in [10]. Theoretical considerations therein ignore stochasticity in the inactive subspace. In this paper, we include this aspect and perform a complete and rigorous analysis of approximating functions in active subspaces. Eventually, we aim at providing a comprehensive probabilistic framework that generalizes the existing theoretical setting from [8, 10]. Our findings are supported by a simple test example.

The manuscript is structured as follows. Section 2 formulates the mentioned problem generally without the notion of an active subspace. In Section 3, we explain and derive the concept of an active subspace in detail and set up a probabilistic setting for treating randomness in the inactive subspace. The main results on approximating functions in active subspaces via a Monte Carlo approximation of a conditional expectation are contained in Section 4 (see e. g. Theorem 4.3 and Theorem 4.6). Additionally, a simple numerical example is shown to support theoretical results. Section 5 is devoted to restate and extend a result from [10]. A summary and concluding comments are collected in Section 6.

2. Problem formulation

Suppose two random variables

and

follow a joint distribution with joint density

. Also, assume that the corresponding marginal and conditional densities are defined in the usual way [2, Section 20 and 33]. Define

(2.1)

for a function which is integrable w.r.t. in its second argument for every . Let us approximate by a finite Monte Carlo sum

(2.2)

where and denotes the number of Monte Carlo samples used for each . That means is a random variable for every . Now, assume that a high-dimensional variable is divided into and , i. e. . For convenience, define a function

(2.3)

defined on the corresponding high-dimensional domain.

The first main point of this manuscript is to give a rigorous description, in the context of active subspaces, of why the expression

(2.4)

where , is in general again random. The second point deals with the consequences (of treating this expression as non-deterministic) that lie in expanding results from [8, 10] in the given probabilistic framework.

3. Active subspaces

Active subspaces, introduced in [5, 8, 10], try to find a ridge approximation [6, 20] to a function , open, i. e. for all with a function , , and a matrix . Obviously, one hopes that to sufficiently reduce the dimension. is computed to hold the directions in which is more sensitive, on average, than in other directions. This means that is nearly insensitive, on average, for directions in the null space of since for each

. The notion ”on average” is crucial in the following and means that sensitivities are weighted with a probability density function

defined on .

In the following, let denote the set of all ’s with a positive density value, i. e.

(3.1)

We assume that is open and hence . Thus, is assumed to be zero on the boundary . Also, suppose that is a continuity set, i. e. . We will often make use of the fact that it is enough to integrate over instead of when weighting with . In order to find the matrix for the ridge approximation, we assume that has partial derivatives that are square integrable w.r.t. . Additionally, we assume that is bounded and continuous on .

To study sensitivities, we regard an orthogonal eigendecomposition of the averaged outer product of the gradient with itself, i. e.

(3.2)

where denotes the eigenvalue matrix with descending eigenvalues and consists of all corresponding normed eigenvectors. The fact that is real symmetric implies that the eigenvectors can be chosen to give an orthonormal basis (ONB) of . Since is additionally positive semi-definite, it holds that , . Note that the eigenvalues

(3.3)

reflect averaged sensitivities of in the direction of corresponding eigenvectors. That means that changes little, on average, in the directions of eigenvectors with small eigenvalues.

If it is possible to find a sufficiently large spectral gap, we can accordingly split . That is, , , holds the directions for which is more sensitive, on average, than for directions in . Dimension denotes the number of eigenvalues before the spectral gap. The size of the gap is crucial for the approximation quality of the active subspace [5]. After splitting , we can get a new parametrization of . We write

(3.4)

with , . The variable is called the active variable and the column space of , , the active subspace.

Notation

Throughout the remainder, we use some notation to avoid uninformative text. According to the previous lines, define . This will shorten text in following equations. Also, for a compatible pair of a matrix and a set , we define . Additionally, for a set , we set , i. e.  is the set of -coordinates of points in .

Probabilistic setting

Let be an abstract probability space. The random variable stands for viewed as a random element whose push-forward measure has Lebesgue density . The random variables and representing random elements in the active and inactive subspace also induce corresponding push-forward measures and . It is possible to define a joint probability density function for the active and inactive variable with , i. e.

(3.5)

Note that inherits boundedness from . The marginal densities are also defined in the usual way [2, Section 20 and 33], i. e.

(3.6)

and

(3.7)

Note that

(3.8)

For convenience, we define domains for the active and inactive variable, i. e.

(3.9)

Note that will be the domain of the low-dimensional function approximating .

The following lemma shows that and can be characterized as sets of vectors in the active and, respectively, inactive subspace with positive marginal density values. Let therefore

(3.10)

We need the result for a proper definition of conditional densities.

Lemma 3.1.

It holds that

(3.11)
Proof.

We only show the equality for since the proof follows the same arguments for . Take arbitrary and choose such that . Set and . By the openness of and the continuity of on , we can find a neighborhood of such that

(3.12)

for each . It follows that

(3.13)

and thus . Reversely, choose . Since , it exists a such that . Since , it follows . ∎

Corollary 3.2.

It holds that and , i. e.  a.s. and a.s.

Lemma 3.3.

The sets and are open in respective topological spaces.

Proof.

We only show the openness for since the proof follows the same arguments for . Let . By definition, there exists with . Since is open, there exists a ball . Since , it suffices to show that .

Take . We compute

(3.14)
(3.15)

Since is orthogonal and hence causes only a rotation, the set has positive measure under which justifies the last equation above. The result follows by Lemma 3.1. ∎

In particular, the previous lemma implies that and . Another auxiliary result shows that it is enough for the marginal densities to integrate over and , respectively.

Lemma 3.4.

It holds that

(3.16)
Proof.

We only show the equality for since the proof follows the same arguments for . Let . We write

(3.17)

For , it holds that , since otherwise which is contradictory. It follows that in (3.17)

(3.18)

which implies the desired result. ∎

As a consequence of Lemma 3.1, we are able to define a proper conditional density on given , i. e.

(3.19)

Note that for and arbitrary

as it was shown in the previous proof. It is possible to define a regular conditional probability distribution of

given for ,

(3.20)

For details of the construction, see e. g. [12]. It can be connected to the respective conditional density by

(3.21)

for . For , let us define for all . In the following, the random variable , , is drawn from the conditional probability distribution defined in (3.20). However, it will be necessary to also regard as random which we denote by the random variable . That is, the measure that is drawn from is also random. An abstract framework to deal with in this context is the notion of a random measure (see e. g. [17]).

In order to apply Fubini’s theorem, which requires product measurability of the function to be integrated, in Theorem 4.3 and 4.6, we need to prove a measurability result for the map

(3.22)
(3.23)

It will be used in Lemma 4.1 to get a product measurable function. Note that we regard as a topological space equipped with the usual subspace topology denoted by .

Lemma 3.5.

The map is -measurable.

Proof.

Let .

For the moment, assume that

is real-valued and change its notation to . Let

denote the cumulative distribution function of

. Note that the map is -measurable for each . Let . Indeed, for , it holds that

(3.24)

The measurability follows from the product measurability of . By the probability integral transform, we can write

(3.25)

where and is the (generalized) inverse distribution function of . Hence, it suffices to show product measurability of . It holds that

(3.26)

The last step follows from the measurability of and the fact that is -measurable.

Now, let us assume that is again -valued. Also, let denote the cumulative distribution function of , . Similarly to the one-dimensional case, the map is -measurable for each , . Again, we can write

(3.27)

where and

(3.28)

is called a copula distribution of [19] and denotes the (generalized) inverse distribution of , . Hence, it suffices to show product measurability of . This follows by noting that the map is measurable and by applying the steps from the one-dimensional case component-wise. It follows that is -measurable. ∎

Notation

It is important to clarify some notation that is used throughout the remainder. We will use three different expectations for the integration of random variables , and . The respective expectations will be denoted by , and .

Also, often we will use a change of variables from to during integration. For that, a useful statement used frequently is proved in the subsequent lemma.

Lemma 3.6.

For any real-valued function , it holds that

(3.29)
Proof.

Define . Note that . Integration by substitution for multiple variables gives

(3.30)
(3.31)
(3.32)
(3.33)
(3.34)
(3.35)
(3.36)

In (3.32), we use that

for the orthogonal matrix

. ∎

4. Approximating functions in the active subspace

Once the active subspace is computed, the function can be approximated by a function on a lower-dimensional domain. One way to define a suitable approximation is by a conditional expectation, i. e.

(4.1)
(4.2)
(4.3)
(4.4)

for . Note that the last line is justified by the fact that for and arbitrary . The conditional expectation is known to be the best approximation of in the active subspace [2]. To get a function on the same domain as , i. e. , let us define

(4.5)

In practice, the weighted integral in (4.4) can be approximated by a finite Monte Carlo sum with

(4.6)

for . Note that is again random for every . Similar to (4.5), we define a suitable function on by

(4.7)

It is important to recognize the following relationship between the expectations of and . It holds that

(4.8)

This equation is a crucial point in this manuscript. It makes clear that both expressions and are random. More explicitly, this can be seen by the following equations. Let be fixed. Then,

(4.9)
(4.10)
(4.11)
(4.12)
(4.13)

Note that in (4.11), the variable , ”belonging” to , disappears such that the integral w.r.t. gets . The random variables within are not integrated by the integral over , i. e. the variables are not bound in terms of formal languages. This leads to the fact that is again random.

We can now regard expressions or . We will show that both are equal and regard the first, i. e. . In the proof of Theorem 4.3, we thus compute the expectation of the mean squared error between and and have to change the order of integration w.r.t. and , i. e. apply Fubini’s theorem. In order to do this properly, we need to show the measurability of in the product space .

Lemma 4.1.

The function is -measurable.

Proof.

Lemma 3.5 proves product measurability of , defined in (3.22). This implies the result. ∎

One important result, already proved in [8, Theorem 3.1], gives an upper bound on the mean squared error of by the eigenvalues corresponding to the inactive subspace. For the sake of completeness, it is stated and also proved in our notational context.

Theorem 4.2.

It holds that

(4.14)

where is the Poincaré constant depending on and .

Proof.

Note that

(4.15)

for every . It follows that

(4.16)
(4.17)
(4.18)
(4.19)

In (4.16), we use Lemma 3.6. (4.17) uses a Poincaré inequality which is applicable due to (4.15) [1]. The last line follows from [8, Lemma 2.2]. ∎

That means, if all eigenvalues corresponding to the inactive subspace are small or even zero, then the mean squared error of the conditional expectation is also small or zero. Contrarily, if the inactive subspace is spanned by too many eigenvectors with rather large corresponding eigenvalues, then the approximation might be poor.

Lemma 4.1 not only proves that is indeed a random variable, i. e. a measurable map from to . Additionally, we are ready to prove a crucial result with it. The main difference to [8, Theorem 3.2] is that the result is an upper bound on the expectation of the mean squared error of to .

Theorem 4.3.

It holds that

(4.20)
Proof.

For fixed ,

(4.21)
(4.22)

The last line is an application of Lemma 3.6. Note that for fixed (and variable ) it holds that

(4.23)
(4.24)
(4.25)
(4.26)

In (4.24), we used that , are independent and identically distributed. Taking expectations w.r.t. for the expression in (4.22) gives

(4.27)
(4.28)
(4.29)
(4.30)
(4.31)
(4.32)
(4.33)
(4.34)

(4.27) applies Fubini’s theorem which is applicable since is product measurable due to Lemma 4.1. The result of (4.26) justifies (4.28). In (4.30), we use again that , , are independent and identically distributed for fixed . Lemma 3.6 with gives (4.32). The last equation in (4.34) follows from Lemma 4.2. ∎

The number of samples in the approximating sum shows up in the bound’s denominator which is common for Monte Carlo type approximations (the root mean squared error is ).

Stability

In practice, the matrix in (3.2) and its eigendecomposition giving the active subspace are only approximately available. A well-investigated way to get an approximation is by using a finite Monte Carlo sum [5]. Independently of the concrete type of approximation, we only have a perturbed representation of the active and inactive subspace available that we denote here by

(4.35)

This subsection is dedicated to discuss MSE analysis for perturbations. We repeat the behavior of active subspaces with respect to perturbations from [8] and extend theorems where necessary. In the following, we will denote perturbed terms with a hat ( ). For the sake of clarity, let us recall the definitions of the approximating function and its Monte Carlo version for the context of perturbed quantities. Analogously to the context without perturbations, the domain of is denoted by . Define

(4.36)

and

(4.37)

for and . Note that it is actually enough to integrate over in (4.36). Let denote the Euclidean norm throughout the rest of the manuscript and assume that

(4.38)

for some .

For subsequent statements, we need a small helping lemma. It is already stated in [8, Lemma 3.4], however our proof is only slightly different (there appears a in (4.41) instead of ).

Lemma 4.4.

Under the assumption in (4.38), it holds that

(4.39)
Proof.

By orthogonality of the columns of and , it holds that

(4.40)

Additionally,