# Mean Dimension of Ridge Functions

We consider the mean dimension of some ridge functions of spherical Gaussian random vectors of dimension d. If the ridge function is Lipschitz continuous, then the mean dimension remains bounded as d→∞. If instead, the ridge function is discontinuous, then the mean dimension depends on a measure of the ridge function's sparsity, and absent sparsity the mean dimension can grow proportionally to √(d). Preintegrating a ridge function yields a new, potentially much smoother ridge function. We include an example where, if one of the ridge coefficients is bounded away from zero as d→∞, then preintegration can reduce the mean dimension from O(√(d)) to O(1).

## Authors

• 1 publication
• 16 publications
• ### Dimension Reduction via Gaussian Ridge Functions

Ridge functions have recently emerged as a powerful set of ideas for sub...
02/01/2018 ∙ by Pranay Seshadri, et al. ∙ 0

• ### Ridge Regularizaton: an Essential Concept in Data Science

Ridge or more formally ℓ_2 regularization shows up in many areas of stat...
05/30/2020 ∙ by Trevor Hastie, et al. ∙ 98

• ### Risk Bounds for High-dimensional Ridge Function Combinations Including Neural Networks

Let f^ be a function on R^d satisfying a spectral norm condition. Fo...
07/05/2016 ∙ by Jason M. Klusowski, et al. ∙ 0

• ### Universal features of mountain ridge patterns on Earth

We study structure of the mountain ridge systems based on the empirical ...
04/10/2018 ∙ by Rafal Rak, et al. ∙ 0

• ### Embedded Ridge Approximations: Constructing Ridge Approximations Over Localized Scalar Fields For Improved Simulation-Centric Dimension Reduction

Many quantities of interest (qois) arising from differential-equation-ce...
07/16/2019 ∙ by Chun Yui Wong, et al. ∙ 0

• ### Minimax Lower Bounds for Ridge Combinations Including Neural Nets

Estimation of functions of d variables is considered using ridge combi...
02/09/2017 ∙ by Jason M. Klusowski, et al. ∙ 0

• ### Asymptotics of Ridge(less) Regression under General Source Condition

We analyze the prediction performance of ridge and ridgeless regression ...
06/11/2020 ∙ by Dominic Richards, et al. ∙ 2

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Numerical integration of high dimensional functions is a very common and challenging problem. Under the right conditions, quasi-Monte Carlo (QMC) sampling and randomized QMC (RQMC) sampling can be very effective. A good result can be expected from (R)QMC if the following conditions, described in more detail below, all hold:

1. the (R)QMC points have highly uniform low dimensional projections,

2. the integrand is nearly a sum of low dimensional parts, and

3. those parts are regular enough to benefit from (R)QMC.

The first condition is a usual property of (R)QMC points. In a series of papers, Griebel, Kuo and Sloan [9, 10, 11] address the third condition by showing that the low dimensional parts of (defined there via the ANOVA decomposition) are at least as smooth as the original integrand and are often much smoother. They include conditions under which lower order ANOVA terms of functions with discontinuities (jumps) or discontinuities in their first derivative (kinks) are smooth. An alternative form of regularity, instead of smoothness, is for the low dimensional parts to have QMC-friendly discontinuities as described in [35]. In this article we explore sufficient conditions for the remaining second condition to hold. We use the mean dimension [26] to quantify the extent to which low dimensional components dominate the integrand.

This article is focused on ridge functions defined over . Ridge functions take the form for an orthonormal projection matrix where , with being an important special case. Ridge functions are useful here because we can find their integrals via low dimensional integration or even closed form expressions. That lets us investigate the impact of some qualitative features of on the integration problem. Additionally, many functions in science and engineering are well approximated by ridge functions with small values of [3], so good performance on ridge functions could extend well to many functions in the natural sciences. As one more example, the value of a European option under geometric Brownian motion is a ridge function of the Brownian increments and this is what allows the formula of Black and Scholes to be applied [8].

Our main finding is that there is an enormous difference between functions with jumps and functions with kinks. This is perhaps surprising. Based on criteria for finite variation in the sense of Hardy and Krause, one might have thought that a jump in dimensions would be similar to a kink in . Instead, we find that for Lipschitz continuous , the mean dimension of is bounded as and that bound can be quite low. For with step discontinuities, we find that the mean dimension can easily grow proportionally to . These effects were seen empirically in [28] where ridge functions were used to illustrate a scrambled Halton algorithm. Preintegration [12] turns a ridge function over with a jump into one with a kink, and ridge functions of Gaussian variables containing a jump can even become infinitely differentiable. The resulting Lipschitz constant need not be small. For a linear step function we find that preintegration can either increase mean dimension or reduce it from to .

An outline of this paper is as follows. Section 2 provides notation and background concepts related to quasi-Monte Carlo and mean dimension. Section 3 introduces ridge functions and establishes upper bounds on their mean dimension in terms of Hölder and Lipschitz conditions and some spatially varying relaxations of those conditions. Corollary 1 there shows that a ridge function with Lipschitz constant

and variance

cannot have a mean dimension larger than in any dimension for any projection . Section 4 considers ridge functions with jumps. They can have mean dimension growing proportionally to and sparsity of makes a big difference. Section 5 considers the effects of preintegration on ridge functions. The preintegrated functions are also ridge functions with a Hölder constant no worse than the original function had. Preintegration can either raise or lower mean dimension. We give an example step function where preintegration leaves the mean dimension asymptotically proportional to with an increased lead constant. In another example, preintegration can change the mean dimension from growing proportionally to to having a finite bound as . Section 6 computes some mean dimensions using Sobol’ indices. Section 7 has conclusions, and a discussion of how generally these results may apply. Section 8 is an appendix containing the longer proofs.

## 2 Background and notation

We use

for the standard Gaussian probability density function and

for the corresponding cumulative distribution function. We consider integration with respect to a

-dimensional spherical Gaussian measure,

 μ≡∫Rdf(x)(2π)−d/2e−∥x∥2/2dx=∫(0,1)df(Φ−1(x))dx,

where the quantile function

is applied componentwise. The (R)QMC approximations to take the form for points and . The distribution of is denoted or simply if is understood from context.

### QMC and Koksma-Hlawka

For QMC, the Koksma-Hlawka inequality [14]

 |^μ−μ|⩽D∗n×∥~f∥HK (1)

bounds the error in terms of the star discrepancy of the points used and the total variation of in the sense of Hardy and Krause. Constructions with are known [21, 5, 31], proving that QMC can be asymptotically better than Monte Carlo (MC) sampling which has a root mean squared error of . That argument requires which requires at a minimum that be a bounded function on . Scrambled net RQMC has a root mean squared error that is for any without requiring bounded variation [24].

### Kinks and jumps

A kink function is continuous with a discontinuity in its first derivative along some manifold. Griebel et al. [12] consider kink functions of the form where is smooth. The kink takes place within the set . A jump function has a step discontinuity along some manifold. Griewank et al. [12] consider jump functions of the form where

is also smooth. There can be jump discontinuities within the set

. When , the result is a kink function. In the rest of this paper, denotes a unit vector.

### ANOVA and mean dimension

The ANOVA decomposition applies to any measurable and square integrable function of independent random inputs. In our case, those inputs will be either or .

We use for , and for , we write for the cardinality of and for the complement . The point has components for . The point has the components for . We abbreviate to . For and points , the hybrid point has for and otherwise.

The ANOVA decomposition [15, 33, 6] of is where depends on only through . For these functions, the line integral whenever and from that it follows that when and then

 σ2=σ2(f)=E((f(x)−μ)2)=∑u:|u|>0σ2u

for variance components for and .

The mean dimension of (in the superposition sense) is

 ν(f)=∑u⊆1:d|u|σ2u∑u⊆1:dσ2u.

If we choose with probability proportional to then is the average of . Effective dimension is commonly defined via a high quantile of that distribution such as the 99’th percentile [2]. Such an effective dimension could well be larger than the mean dimension but it is more difficult to ascertain.

The mean dimension and a few other quantities that we use are not well defined when . In such cases, is constant almost everywhere and we will not ordinarily be interested in integrating it. We assume below, without necessarily stating it every time, that .

Sobol’ indices are used to quantify the importance of a variable or more generally a subset of them. We will use the (unnormalized) Sobol’ total index for variable ,

 ¯¯¯τ2j=∑u:j∈uσ2u.

More generally, for , we set . An easy identity from [19] gives . Sobol’ [34] shows that

 ¯¯¯τ2j=12E((f(x−j:xj)−f(x−j:zj))2)

when and are independent random vectors with the same product distribution on . As a result we find that

 ν(f)=12σ2E(d∑j=1(f(x)−f(x−j:zj))2). (2)

The expectation in the numerator of is a -dimensional integral over independent and . It is commonly evaluated by (R)QMC.

### Low effective dimension

Applying (1) componentwise yields

 |^μ−μ|⩽∑uD∗n(x1,u,…,xn,u)×∥~fu∥HK.

The coordinate discrepancies are known to decay rapidly when is small [5]. If also is negligible when is not small then can be considered to have low effective dimension and an apparent error for QMC can be observed. Some other ways to decompose a function into a sum of functions, one for each subset of , are described in [17]. For a survey of effective dimension methods in information based complexity, see [36].

To avoid the dependence on finite variation and to control the logarithmic terms we will use a version of RQMC known as scrambled nets. Under scrambled net sampling [23] each , while collectively remain digital nets with probability one, retaining their low discrepancy. The mean squared error of scrambled net sampling decomposes as

 E((^μ−μ)2)=∑|u|>0E((1nn∑i=1~fu(xi))2)=∑|u|>0Var(1nn∑i=1~fu(xi)) (3)

where expectation refers to randomness in the [24]. If , then

 Var(1nn∑i=1~fu(xi))=o(1n)andVar(1nn∑i=1~fu(xi))⩽Γσ2un, (4)

for some gain coefficient [25]. If also then

 Var(1nn∑i=1~fu(xi))=O(log(n)|u|−1n3). (5)

If large have negligible and small are smooth enough for (5) to hold then RQMC may attain nearly root mean squared error. The logarithmic factors in (5) cannot make the variance much larger than the MC rate because the bound in (4) applies for finite .

The ANOVA decomposition of on is essentially the same as that of on . Specifically, .

Discontinuities can lead to severe deterioration in the asymptotic behavior of RQMC. He and Wang [13] obtain MSE rates of for jump discontinuities of the form where the set has a boundary with -dimensional Minkowski content. When is the Cartesian product of a hyper-rectangle and a -dimensional set with a boundary of -dimensional Minkowski content, then takes the place of in their rate. The smaller is, the more ‘QMC-friendly’ the discontinuity is.

## 3 Ridge functions

We let , choose an orthonormal matrix , and define the ridge function

 f(x)=g(ΘTx), (6)

where . We must always have because otherwise is impossible to attain. Our main interest is in . Ridge functions can also be defined for but then the domain of becomes a complicated polyhedron called a zonotope [3].

When , we write

 f(x)=g(θTx), (7)

where . Then, because we find that

 μ=∫∞−∞g(z)φ(z)dzandσ2=∫∞−∞(g(z)−μ)2φ(z)dz.

We can get the answer and the corresponding RMSE under MC by one dimensional integration. For some , one or both of these quantities are available in closed form. Note that and above are both independent of and even of . For more general we find that and are -dimensional integrals that do not depend on or on . Apart from a few remarks, we focus mostly on the case with .

By symmetry we can take all . It is reasonable to expect that sparse vectors will make the problem of intrinsically lower dimension. Sparsity is typically defined via small values of . It is common to use instead a proxy measure , with smaller values representing greater sparsity, relaxing an quantity to an quantity. By this measure, the ‘least sparse’ unit vectors are of the form while sparsest are of the form where is the ’th standard Euclidean basis vector.

We will need some fractional absolute moments of the

distribution. For define

 Mη=∫∞−∞|y|ηφ(y)dy=2η/2√πΓ(η+12). (8)

This is from formula (18) in an unpublished report of Winkelbauer [37]. It can be verified directly by change of variable to .

###### Theorem 1

Let be a ridge function described by (6) for , where satisfies a Hölder condition for , , and . Then the mean dimension of satisfies

 ν(f)⩽(Cσ)22α−1M2α×d∑j=1(r∑k=1Θ2jk)α, (9)

where does not depend on .

###### Proof

Let and be independent random vectors. For , let be the ’th row of as a row vector. Then . Next

 ¯¯¯τ2j (10)

because . Summing over gives (10). Finally, depends on the distribution of for which is independent of .

If , then we recognize as where is a matrix norm [22]. For , we get and this is then not a norm. If

is an orthogonal matrix, then

. Now so we can replace in (9) by . For , we get , the squared Frobenius norm of , and the bound in (9) simplifies to reveal a proportional dependence on .

###### Corollary 1

Let be a ridge function described by (6) where is Lipschitz continuous with constant and with , for . Then

 ν(f)⩽r×(Cσ)2

where does not depend on .

###### Proof

Take in Theorem 1.

The bound in Theorem 1 and its corollaries is conservative. It allows for the possibility that for all pairs of points . If that would hold for and , then it would imply that is linear. To see why, note that any triangle with points , , and , for distinct would have one angle equal to . A linear function would then have mean dimension , the smallest possible value when . A less conservative bound is in Section 3.1 below. The next result show that the bound has a dimensional effect when .

###### Corollary 2

Let be a ridge function given by (7) with , where is Hölder continuous with constant and exponent and is a unit vector for . Then

 ν(f)⩽(Cσ)22α−1M2αd1−α.

###### Proof

From Theorem 1, . The largest value this can take arises for . Then and so as required.

### 3.1 Spatially varying Hölder and Lipschitz constants

A Lipschitz or Hölder inequality provides a bound on that holds for all . The numerator in is a weighted average of over points and indices , and for a ridge function that reduces to a weighted average of Applying a Lipschitz or Hölder inequality bounds an quantity by the square of an quantity.

We say that satisfies a spatially varying Hölder condition if for some there is a function such that

 |g(y)−g(y′)|⩽C(y)∥y−y′∥α (11)

holds for all and . If , then satisfies a spatially varying Lipschitz condition. The well known locally Lipschitz condition is different. It requires that every be within a neighborhood on which has a finite Lipschitz constant . Equation (11) is stronger because it also bounds for .

We will use a Hölder inequality via and satisfying to slightly modify the proof in Theorem 1. Under (11)

 σ2ν(f) ⩽12d∑j=1E(C(ΘTx)2∥ΘTj⋅(zj−xj)∥2α) ⩽12E(|C(y)|2p)1/pd∑j=1E(∥ΘTj⋅(zj−xj)∥2αq)1/q(with y∼N(0,Ir)) ⩽2α−1E(|C(y)|2p)1/pM1/q2αqd∑j=1∥ΘTj⋅∥2α. (12)

Allowing would have made and then the supremum norm of would be infinite, leading to a useless bound. For , we interpret as recovering Theorem 1. The bound (12) simplifies for and for . Under both simplifications,

 ν(f)⩽1σ2E(C(y)2p)1/pM1/q2q.

To get a finite bound for it suffices for to have a finite moment of order for some .

### 3.2 A kink function

As a prototypical kink function, consider given by (7) with for some threshold . This is Lipschitz continuous with . Using indefinite integrals and , the first two moments of are

 μ(t) =∫∞−∞max(y−t,0)φ(y)dy=φ(t)−tΦ(−t),and (μ2+σ2)(t) =∫∞−∞max(y−t,0)2φ(y)dy=Φ(−t)(1+t2)−tφ(t),so σ2(t) =Φ(−t)(1+t2)−tφ(t)−φ(t)2+2tφ(t)Φ(−t)−t2Φ(−t)2.

Because , we get . For , we get and and then

 ν(f)⩽11/2−1/(2π)=2ππ−1≐2.933,

for any and any unit vector .

### 3.3 The least sparse case

The least sparse unit vectors have all . Because is symmetric we may take . In this case, it is easy to compute using Sobol’ indices. By symmetry, equals a three dimensional integral

 (13)

for any . Furthermore, by comparing results for to those for we can see some impact from sparsity because the least sparse unit vector for dimension will give the same answer as a very sparse dimensional vector with zeros and the remaining components equal.

## 4 Jumps

While both kinks and jumps may have smooth low dimensional ANOVA components, jumps do not necessarily have the same low mean dimension. They are also sensitive to sparsity of .

### 4.1 Linear step functions

First we consider a step function . We get upper and lower bounds for the mean dimension of this function in terms of the nominal dimension and , our sparsity measure. Over the range from sparsest to least sparse .

###### Theorem 2

Let for a threshold and a unit vector . Then, for ,

 ν(f)⩽∥θ∥1Φ(t)Φ(−t)√2π(√2+2√log(∥θ∥−11d))=O(√dlog(d)).

###### Proof

See Section 8.1 of the Appendix.

The rate in Theorem 2 arises for . More generally we get

 O(∥θ∥1√log(d/∥θ∥1))=O(∥θ∥1√log(d)).

For instance, if has components equal to and the rest equal to zero, then the upper bound is . There can thus be a significant improvement due to sparsity of .

###### Theorem 3

Let for a threshold and a unit vector . Then, for ,

 ν(f)⩾∥θ∥1Φ(t)Φ(−t)23/2πe−t2−1.

###### Proof

See Section 8.3 of the Appendix.

The proof of Theorem 3 requires a certain lower bound on a bivariate Gaussian probability. We did not find many such lower bounds in the literature, so this may be new and may be of independent interest.

###### Lemma 1

Let

 (xy)∼N((00),(1ρρ1))

with and choose . Then

 Pr(x>t,y

###### Proof

See Section 8.2 of the Appendix.

Choosing in Theorem 3 provides an example of a set of jump functions with mean dimension bounded below by a positive multiple of . Here again sparsity plays a role in the bound.

The bounds in both Theorems 2 and 3 depend on . The upper bound argument in Theorem 2 uses a mean value approximation where could be replaced by a value just over , yielding for that

 ν(f)⩽2√log(d/∥θ∥1)Φ(t)Φ(−t)(o(1)+φ(t)(1+o(1)))=O(2√log(d/∥θ∥1)Φ(t)t2+1t)

by a Mills’ ratio inequality as . As a result the upper bound is not as sensitive to large as the presence of in the denominator from Theorem 2 would suggest.

The case is simpler. We find

 ν(f) =1Φ(0)2d∑j=1Pr(θTx>0,θTx+θj(zj−xj)<0) =4d∑j=1∫∞0φ(x)Φ(−ρjx/√1−ρ2j)dx,ρj=1−θ2j =d∑j=12π(π2−arctan(−ρj/√1−ρ2j)),

using a definite integral from Section 2.5.2 of [30]. After some algebra

 ν(f)=2πd∑j=1arcsin(|θj|)⩾2π∥θ∥1. (14)

Now as . Therefore holds if holds as . Thus there is no asymptotic factor when and, from the details of our proof, we suspect it is not present for other .

### 4.2 More general indicator functions

It is reasonable to expect indicator functions to have such large mean dimension for more general sets than just half spaces in

under a spherical Gaussian distribution. Here we sketch a generalization. First, for an indicator function

of a measurable set we have

 ν(f)=d∑j=1E(Pr(x∈Ω∣x−j)Pr(x∈Ωc∣x−j))/[μ(1−μ)] (15)

for . The numerator expectations are with respect to random , and (15) holds for any distribution on with independent components, including and . We work with the latter case in what follows.

As in [10, 11] we take and place conditions on . Let be strictly monotone in each coordinate . Without loss of generality, suppose that is strictly increasing in each . Suppose additionally that and for all and all .

For any , there is a unique value for which . We write and sometimes suppress its dependence on . We can make a linear approximation to the boundary of at via where both , the normalized gradient of , and depend on . By monotonicity of , each . Let

 δj(x−j) ≡Pr(ϕ(x)⩾0∣x−j)Pr(ϕ(x)<0∣x−j) =Φ(∑ℓ≠jxℓθ∗ℓ(x−j)−t∗(x−j)θ∗j(x−j))Φ(t∗(x−j)−∑ℓ≠jxℓθ∗ℓ(x−j)θ∗j(x−j)),and δ(x) =d∑j=1δj(x−j).

Now . In words, is what we would get by sampling , finding the boundary points corresponding to the component directions , summing the corresponding values, and averaging the results over all samples. Each point leads to consideration of points . This process produces an unequally weighted average over points of a sum of values determined by the tangent plane at .

For a linear , we get , and we find from Theorem 3 that is then bounded below by a multiple of which can be as large as . For more general , the boundary set is no longer an affine flat, the sparsity measure varies spatially over , and so does the length . A large mean dimension, comparable to , could arise if has a nonsparse gradient over an appreciable proportion of .

If the assumption that fails, or if fails, for some value , then we can no longer find the corresponding point . In that case, the given value of and contribute nothing to the numerator of . The mean dimension can still be large due to contributions from other values of and from other . A similar issue came up in [11] where existence of for every proved not to be satisfied by an integrand from computational finance, and also proved not to be necessary for the smoothing effect of ANOVA to hold.

### 4.3 Cusps of general order

For and , consider a cusp of order given by

 fd,p(x)=(d∑j=1xj−(d−1))p+ (16)

taking . Now for [27], for , and more generally for . The higher the dimension, the greater smoothness is required to have finite variation. The boundary is not parallel to any of the coordinate axes, so this integrand is not QMC-friendly in any way.

These functions are carefully constructed to be among the simplest with the prescribed level of smoothness. As a result, we may find their mean dimension analytically.

###### Theorem 4

The function defined above for has mean dimension

 ν(fd,p) =d×Γ(2p+1)Γ(2p+d+1)−(Γ(p+1)Γ(p+2))2Γ(2p+3)Γ(2p+d+2)Γ(2p+1)Γ(2p+d+1)−(Γ(p+1)Γ(p+d+1))2.

###### Proof

See Section 8.4 of the Appendix.

The functions have jumps. Taking in Theorem 4 yields

 ν(fd,0) =d(Γ(1)Γ(d+1)−(Γ(1)Γ(2))2Γ(3)Γ(d+2))Γ(1)Γ(d+1)−(Γ(1)Γ(d+1))2=d×1−2d+11−1d!.

Thus