# Sequential sampling for optimal weighted least squares approximations in hierarchical spaces

We consider the problem of approximating an unknown function u∈ L^2(D,ρ) from its evaluations at given sampling points x^1,...,x^n∈ D, where D⊂R^d is a general domain and ρ is a probability measure. The approximation is picked in a linear space V_m where m=(V_m) and computed by a weighted least squares method. Recent results show the advantages of picking the sampling points at random according to a well-chosen probability measure μ that depends both on V_m and ρ. With such a random design, the weighted least squares approximation is proved to be stable with high probability, and having precision comparable to that of the exact L^2(D,ρ)-orthonormal projection onto V_m, in a near-linear sampling regime n∼m m. The present paper is motivated by the adaptive approximation context, in which one typically generates a nested sequence of spaces (V_m)_m≥1 with increasing dimension. Although the measure μ=μ_m changes with V_m, it is possible to recycle the previously generated samples by interpreting μ_m as a mixture between μ_m-1 and an update measure σ_m. Based on this observation, we discuss sequential sampling algorithms that maintain the stability and approximation properties uniformly over all spaces V_m. Our main result is that the total number of computed sample at step m remains of the order mm with high probability. Numerical experiments confirm this analysis.

## Authors

• 1 publication
• 9 publications
• 15 publications
• ### Optimal sampling and Christoffel functions on general domains

We consider the problem of reconstructing an unknown function u∈ L^2(D,μ...
10/21/2020 ∙ by Albert Cohen, et al. ∙ 0

• ### Optimal sampling strategies for multivariate function approximation on general domains

In this paper, we address the problem of approximating a multivariate fu...
08/04/2019 ∙ by Ben Adcock, et al. ∙ 0

• ### Boosted optimal weighted least-squares

This paper is concerned with the approximation of a function u in a give...
12/15/2019 ∙ by Cécile Haberstich, et al. ∙ 0

• ### Optimal pointwise sampling for L^2 approximation

Given a function u∈ L^2=L^2(D,μ), where D⊂ℝ^d and μ is a measure on D, a...
05/12/2021 ∙ by Albert Cohen, et al. ∙ 0

• ### Convergence bounds for empirical nonlinear least-squares

We consider best approximation problems in a (nonlinear) subspace M of a...
01/02/2020 ∙ by Martin Eigel, et al. ∙ 0

• ### Multivariate approximation of functions on irregular domains by weighted least-squares methods

We propose and analyse numerical algorithms based on weighted least squa...
07/29/2019 ∙ by Giovanni Migliorati, et al. ∙ 0

• ### Learning Functions of Few Arbitrary Linear Parameters in High Dimensions

Let us assume that f is a continuous function defined on the unit ball o...
08/18/2010 ∙ by Massimo Fornasier, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Least squares approximations are ubiquitously used in numerical computation when trying to reconstruct an unknown function defined on some domain from its observations at a limited amount of points . In its simplest form the method amounts to minimizing the least squares fit

 1nn∑i=1|yi−v(xi)|2, (1.1)

over a set of functions that are subject to certain constraints expressing a prior on the unknown function . There are two classical approaches for imposing such constraints:

• Add a penalty term to the least squares fit. Classical instances include norms of reproducing kernel Hilbert spaces or norms that promote sparsity of when expressed in a certain basis of functions.

• Limit the search of to a space of finite dimension . Classical instances include spaces of algebraic or trigonometric polynomials, wavelets, or splines.

The present paper is concerned with the second approach, in which approximability by the space may be viewed as a prior on the unknown function. We measure accuracy in the Hilbertian norm

 ∥v∥=(∫D|v(x)|2dρ)1/2=∥v∥L2(D,ρ), (1.2)

where is a probability measure over . We denote by the associated inner product. The error of best approximation is defined by

 em(u):=minv∈Vm∥u−v∥, (1.3)

and is attained by , the -orthogonal projection of onto . Since the least squares approximation is picked in , it is natural to compare with . In particular, the method is said to be near-optimal (or instance optimal with constant ) if the comparison

 ∥u−~u∥≤Cem(u), (1.4)

holds for all , where is some fixed constant.

The present paper is motivated by applications where the sampling points are not prescribed and can be chosen by the user. Such a situation typically occurs when evaluation of

is performed either by a computer simulation or a physical experiment, depending on a vector of input parameters

that can be set by the user. This evaluation is typically costly and the goal is to obtain a satisfactory surrogate model from a minimal number of evaluations

 yi=u(xi). (1.5)

For a given probability measure and approximation space of interest, a relevant question is therefore whether instance optimality can be achieved with sample size that is moderate, ideally linear in .

Recent results of [10, 16] for polynomial spaces and [6] in a general approximation setting show that this objective can be achieved by certain random sampling schemes in the more general framework of weighted least squares methods. The approximation is then defined as the solution to

 minv∈Vm1nn∑i=1w(xi)|yi−v(xi)|2, (1.6)

where is a positive function and the are independently drawn according to a probability measure , that satisfy the constraint

 wdμ=dρ. (1.7)

The choice of a sampling measure that differs from the error norm measure appears to be critical in order to obtain instance optimal approximations with an optimal sampling budget.

In particular, it is shown in [10, 6], that there exists an optimal choice of such that the weighted least squares is stable with high probability and instance optimal in expectation, under the near-linear regime up to logarithmic factors. The optimal sampling measure and weights are given by

 dμm=kmmdρandwm=mkm, (1.8)

where is the so-called Christoffel function defined by

 km(x)=m∑j=1|φj(x)|2, (1.9)

with any -orthonormal basis of .

In many practical applications, the space is picked within a family that has the nestedness property

 V1⊂V2⊂⋯ (1.10)

and accuracy is improved by raising the dimension . The sequence may either be a priori defined, or adaptively generated, which means that the way is refined into may depend on the result of the least squares computation. Example of such hierarchical adaptive or non-adaptive schemes include in particular:

• Mesh refinement in low-dimension performed by progressive addition of hierarchical basis functions or wavelets, which is relevant for approximating piecewise smooth functions, such as images or shock profiles, see [3, 8, 9].

• Sparse polynomial approximation in high dimension, which is relevant for the treatment of certain parametric and stochastic PDEs, see [4].

In this setting, we are facing the difficulty that the optimal measure defined by (1.8) varies together with .

In order to maintain an optimal sampling budget, one should avoid the option of drawing a new sample

 Sm={x1m,…,xnm} (1.11)

of increasing size at each step . In the particular case where are the univariate polynomials of degree , and a Jacobi type measure on , it is known that converges weakly to the equilibrium measure defined by

 dμ∗(y)=dyπ√1−y2, (1.12)

with a uniform equivalence

 c1μ∗≤μm≤c2μ∗, (1.13)

see [13, 17]. This suggests the option of replacing all by the single , as studied in [16]. Unfortunately, such an asymptotic behaviour is not encountered for most general choices of spaces . An equivalence of the form (1.13) was proved in [14] for sparse multivariate polynomials, however with a ratio that increases exponentially with the dimension, which theoretically impacts in a similar manner the sampling budget needed for stability.

In this paper, we discuss sampling strategies that are based on the observation that the optimal measure enjoys the mixture property

 μm+1=(1−1m+1)μm+1m+1σm+1,wheredσm:=|φm|2dρ. (1.14)

As noticed in [12], this leads naturally to sequential sampling strategies, where the sample is recycled for generating . The main contribution of this paper is to analyze such a sampling strategy and prove that the two following properties can be jointly achieved in an expectation or high probability sense:

1. Stability and instance optimality of weighted least squares hold uniformly over all .

2. The total sampling budget after step is linear in up to logarithmic factors.

The rest of the paper is organized as follows. We recall in §2 stability and approximation estimates from

[10, 6] concerning the weighted least squares method in a fixed space . We then describe in §3 a sequential sampling strategy based on (1.14) and establish the above optimality properties 1) and 2). These optimality properties are numerically illustrated in §4 for the algorithm and two of its variants.

## 2 Optimal weighted least squares

We denote by the discrete Euclidean norm defined by

 ∥v∥2n:=1nn∑i=1w(xi)|v(xi)|2, (2.1)

and by the associated inner product. The solution to (1.6) may be thought of as an orthogonal projection of onto for this norm. Expanding

 ~u=m∑j=1cjφj, (2.2)

in the basis of , the coefficient vector is solution to the linear system

 Gmc=d, (2.3)

where is the Gramian matrix for the inner product with entries

 Gj,k:=⟨φj,φk⟩n=1nn∑i=1w(xi)φj(xi)φk(xi), (2.4)

and the vector has entries . The solution always exists and is unique when is invertible.

Since are drawn independently according to , the relation (1.7) implies that , and in particular

 E(Gm)=I. (2.5)

The stability and accuracy analysis of the weighted least squares method can be related to the amount of deviation between and its expectation measured in the spectral norm. Recall that for matrices , this norm is defined as . This deviation also describe the closeness of the norms and over the space , since one has

 ∥Gm−I∥2≤δ⟺(1−δ)∥v∥2≤∥v∥2n≤(1+δ)∥v∥2,v∈Vm. (2.6)

Note that this closeness also implies a bound

 κ(Gm)≤1+δ1−δ, (2.7)

on the condition number of .

Following [6], we use the particular value , and define as the solution to (1.6) when and set otherwise. The probability of the latter event can be estimated by a matrix tail bound, noting that

 Gm=1nn∑i=1Xi, (2.8)

where the are

independent realizations of the rank one random matrix

 X:=w(x)(φj(x)φk(x))j,k=1,…,m, (2.9)

where is distributed according to . The matrix Chernoff bound, see Theorem 1.1 in [18], gives

 (2.10)

where is an almost sure bound for . Since , it follows that is always larger than . With the choice given by (1.8) for the sampling measure, one has exactly , which leads to the following result.

###### Theorem 2.1.

Assume that the sampling measure and weight function are given by (1.8). Then for any , the condition

 n≥cm(ln(2m)−ln(ε)),c:=γ−1=21−ln2, (2.11)

implies the following stability and instance optimality properties:

 (2.12)

The probabilistic inequality (2.12) induces an instance optimality estimate in expectation, since in the event where , one has

 ∥u−~u∥2 =∥u−Pmu∥2+∥~u−Pmu∥2 (2.13) ≤em(u)2+2∥~u−Pmu∥2n≤em(u)2+2∥u−Pmu∥2n,

and therefore

 E(∥u−~u∥2)≤em(u)2+2E(∥u−Pmu∥2n)+ε∥u∥2=3em(u)2+ε∥u∥2. (2.14)

By a more refined reasoning, the constant can be replaced by which tends to as becomes large, see [6].

In summary, when using the optimal sampling measure defined by (1.8), stability and instance optimality can be achieved in the near linear regime

 n=nε(m):=⌈cm(ln(2m)−lnε)⌉, (2.15)

where controls the probability of failure. To simplify notation later, we set .

## 3 An optimal sequential sampling procedure

In the following analysis of a sequential sampling scheme, we assume a sequence of nested spaces and corresponding basis functions to be given such that for each , is an orthonormal basis of . In practical applications, such spaces may either be fixed in advance, or adaptively selected, that is, the choice of depends on the computation of the weighted least squares approximation (1.6) for . In view of the previous result, one natural objective is to genererate sequences of samples distributed according to the different measures , with

 #(Sm)=nε(m), (3.1)

for some prescribed .

The simplest option for generating such sequences would be directly drawing samples from for each separately. Since we ask that is proportional to up to logarithmic factors, this leads to a total cost after step given by

 Cm=m∑k=1#(Sk), (3.2)

which increases faster than quadratically with . Instead, we want to recycle the existing samples in generating to arrive at a scheme such that the total cost remains comparable to , that is, close to linear in .

To this end, we use the mixture property (1.14). In what follows, we assume a procedure for sampling from each update measure to be available. In the univariate case, standard methods are inversion transform sampling or rejection sampling. These methods may in turn serve in the multivariate case when the

are tensor product basis functions on a product domain and

is itself of tensor product type, since are then product measures that can be sampled via their univariate factor measures. We first observe that, in order to draw distributed according to , for some fixed , we can proceed as follows:

 Draw j{1,…,m}, then draw x % from σj. (3.3)

Let now be an increasing sequence representing the prescribed size of the samples . Suppose that we are given i.i.d. according to . In order to obtain the new sample i.i.d. according to , we can proceed as stated in the following Algorithm 1.

Algorithm 1 requires a fixed number of samples from and an additional number of samples from

. The latter is a random variable that can be expressed as

 ~n(m)=n(m)∑i=1bim+1, (3.4)

where for each fixed , the are i.i.d. Bernoulli random variables with . Moreover, is a collection of independent random variables. This immediately gives an expression for the total cost after successive applications of Algorithm 1, beginning with samples from , as the random variable

 Cm:=n(m)+s(m),s(m):=m−1∑k=1~n(k)=m−1∑k=1n(k)∑i=1bik+1. (3.5)

We now focus on the particular choice

 n(m):=nε(m), (3.6)

as in (2.15), for a prescribed . This particular choice ensures that, for all ,

 Pr(∥Gm−I∥2≥12)≤ε, (3.7)

where denotes the Gramian for according to (2.4).

We first estimate for this choice of . For this purpose, we note that and use the following lemma.

###### Lemma 3.1.

For and ,

 12n(m)−2c≤m∑k=1n(k)k+1≤n(m)+1. (3.8)
###### Proof:.

For the upper bound, we note that and

 m∑k=11k+1(ck(ln(2k)−lnε)+1) ≤cm∑k=1(ln(2k)−lnε)+m∑k=11k+1 (3.9) ≤c(mln2+lnm!−mlnε)+ln(m+1),

By the Stirling bound for ,

 lnm!≤m(lnm−1)+12lnm+1. (3.10)

Combining this with (3.9) gives

 m∑k=1n(k)k+1 ≤cm(ln2m−lnε)−c(m−1)+c2lnm+ln(m+1) (3.11) ≤n(m)+1,

where the inequality for is verified for the choice of in (2.11) by direct evaluation for and monotonicity. For the lower bound, we estimate

 m∑k=1n(k)k+1≥cm2(ln2−lnε)+cm∑k=1kk+1lnk. (3.12)

Using monotonicity and integration by parts with , we obtain

 m∑k=1kk+1lnk ≥∫m1xx+1lnxdx (3.13) =−∫m1x−ln(x+1)xdx+[(x−ln(x+1))lnx]m1 =−(m−1)+∫m1ln(x+1)xdx+(m−ln(m+1))lnm.

Moreover,

 ∫m1ln(x+1)xdx=∫lnm0ln(1+et)dt≥∫lnm0tdt=12ln2m, (3.14)

and using this in (3.13) gives

 m∑k=1kk+1lnk≥12mlnm+12c+R(m), (3.15)

where

 R(m)=12m(lnm−2)−ln(m+1)lnm+12ln2m+1−12c>−2. (3.16)

The latter inequality can be directly verified for the first few values of and then follows for by monotonicity. Thus, we obtain

 m∑k=1n(k)k+1≥12[cm(ln2m−lnε)+1]−2c≥12n(m)−2c, (3.17)

which concludes the proof of the lemma. ∎

As an immediate consequence, we obtain an estimate of the total cost in expectation by

###### Theorem 3.2.

With , the total cost after steps of Algorithm 1 satisfies

 n(m)+12n(m−1)−2c≤E(Cm)≤n(m)+n(m−1)+1. (3.18)

We next derive estimates for the probability that the upper bound in the above estimate is exceeded substantially by . The following Chernoff inequality can be found in [1]. For convenience of the reader, we give a standard short proof following [15].

###### Lemma 3.3.

Let , and be a collection of independent Bernoulli random variables with for all . Set . Then, for all ,

 Pr(X1+…+XN≥(1+τ)¯X)≤(1+τ)−(1+τ)¯Xeτ¯X.

In particular, for all ,

 Pr(X1+…+XN≥(1+τ)¯X)≤e−τ2¯X/3.
###### Proof:.

For and , by Markov’s inequality, and using that , ,

Now take . Moreover, for , one has . ∎

We now apply this result to the random part of the total sampling costs as defined in (3.5), and obtain the following probabilistic estimate.

###### Theorem 3.4.

With , the total cost after step of Algorithm 1 satisfies, for any ,

 Pr(Cm≥n(m)+(1+τ)(n(m−1)+1)) ≤Mτe−τ26n(m−1) (3.19) ≤Mτ(2(m−1)ε)−τ2c6(m−1)

with .

###### Proof:.

We apply Proposition 3.3 to the independent variable which appear in , which gives

 Pr(Cm≥n(m)+(1+τ)E(s(m)))=Pr(s(m)≥(1+τ)E(s(m)))≤e−τ2E(s(m))/3. (3.20)

The result follows by using the lower bound on in Lemma 3.1. ∎

With the choice we are ensured that, for each value of separately, the failure probability is bounded by . When the intermediate results in each step are used to drive an adaptive selection of the sequence of basis functions, however, it will typically be of interest to ensure the stronger uniform statement that with high probability, for all . To achieve this, we now consider a slight modification of the above results with -dependent choice of failure probability to ensure that remains well-conditioned with high probability jointly for all . We define the sequence of failure probabilities

 ε(m)=6ε0(πm)2,m≥1, (3.21)

for a fixed , and now analyze the repeated application of Algorithm 1, now using

 n(m):=nε(m)(m)=⌈cm(ln(2m)+2lnm−ln(6ε0/π2))⌉ (3.22)

samples for . Note that differs only by a further term of order from .

###### Lemma 3.5.

For , let be defined as in (3.22) with as in (3.21). Then

 12n(m)−6c≤m∑k=1n(k)k+1≤n(m)+1. (3.23)
###### Proof:.

For the upper bound in (3.23), note that . Using (3.10),

 m∑k=1lnk2≤mlnm2+2(1−m+12lnm)≤mlnm2. (3.24)

Thus, for all ,

 m∑k=1n(k)k+1 ≤cm∑k=1kk+1(ln(2k)−ln(6ε0π2))+m∑k=11k+1+cm∑k=1klnk21+k ≤cm(ln2m−ln(6ε0/π2))+1+cmlnm2 ≤n(m)+1,

where we have used (3.9), (3.11) together with (3.24). Let us deal with the lower bound. For all ,

 m∑k=1n(m)k+1 ≥m∑k=11k+1(ck(ln(2k)−ln(6ε0π2)+lnk2)) ≥12(cm(ln(2m)−ln(6ε0π2))+1)−2c+m∑k=1ckk+1lnk2. (3.25)

Moreover using (3.15) and (3.16)

 cm∑k=1kk+1lnk2≥2c(m2lnm+12c−2).

Thus, combining the previous bound together with (3) and the definition of concludes the proof of the lemma. ∎

As a consequence, analogously to (3.18) we have

 E(Cm)≤n(m)+n(m−1)+1. (3.26)

Using the above lemma as in Theorem 3.4 combined with a union bound, we arrive at the following uniform stability result.

###### Theorem 3.6.

Let be defined as in (3.21). Then applying Algorithm 1 with as in (3.22), one has

 Pr(∃m∈N:∥Gm−I∥2≥12)≤ε0,

and for any and all , the random variable satisfies

 Pr(Cm≥n(m)+(1+τ)(n(m−1)+1)) ≤Mτe−τ26n(m−1) (3.27)

with .

In summary, applying Algorithm 1 successively to generate the samples , we can ensure that holds uniformly for all steps with probability at least . The corresponding total costs for generating can exceed a fixed multiple of only with a probability that rapidly approaches zero as increases.

###### Remark 3.7.

Algorithm 1 and the above analysis can be adapted to create samples from , , using those for , which corresponds to adding basis functions in one step. In this case,

 μm+q=mm+qμm+qm+qm+q∑j=m+11qσj, (3.28)

where samples from can be obtained by mixture sampling as in (3.3).

## 4 Numerical illustration

In our numerical tests, we consider two different types of orthonormal bases and corresponding target measures on :

1. On the one hand, we consider the case where is equal to , is the standard Gaussian measure on and the vector space spanned by the Hermite polynomials normalized to one with respect to the norm up to degree , for all . This case is an instance of polynomial approximation method where the function and its approximants from might be unbounded in , as well as the Christoffel function .

2. On the other hand, we consider the case where with the uniform measure on and the approximation spaces are generated by Haar wavelets refinement. The Haar wavelets are of the form with and , and . In adaptive approximation, the spaces are typically generated by including as the scale level grows, the values of such that the coefficient of for the approximated function is expected to be large. This can be described by growing a finite tree within the hierarchical structure induced by the dyadic indexing of the Haar wavelet family: the indices or can be selected only if has already been. In our experiment, we generate the spaces by letting such a tree grow at random, starting from . This selection of is done once and used for all further tests. The resulting sampling measures exhibit the local refinement behavior of the corresponding approximation spaces .

As seen further, although the spaces and measures are quite different, the cost of the sampling algorithm behaves similarly in these two cases. While we have used univariate domains for the sake of numerical simplicity, we expect a similar behaviour in multivariate cases, since our results are also immune to the spatial dimension . As already mentioned, the sampling method is then facilitated when the functions are tensor product functions and is a product measure.

The sampling measures are shown in Figure 1. In case (ii), the measures are uniform measures on dyadic intervals in , from which we can sample directly, using operations per sample. In case (i), we have instead . Several strategies for sampling from these densities are discussed in [6]. For our following tests, we use inverse transform sampling: we draw uniformly distributed in , and obtain a sample from by solving . To solve these equations, in order to ensure robustness, we use a simple bisection scheme. Each point value of can be obtained using operations, exploiting the fact that for ,

 Φj=Φ−j−1∑k=1HkHk−1√kg,

where is the standard Gaussian density and

its cumulative distribution function. The bisection thus takes

operations to converge to accuracy

. For practical purposes, the sampling for case (i) can be accelerated by precomputation of interpolants for

.

In each of our numerical tests, we consider the distributions of and resulting from the sequential generation of sample sets for by Algorithm 1

. In each case, the quantiles of the corresponding distributions are estimated from 1000 test runs. Figure

2 shows the results for Algorithm 1 in case (i) with as in (3.6), where we choose . We know from Theorem 2.1 that, for each , one has with probability greater than . In fact, this bound appears to be fairly pessimistic, since no sample with is encountered in the present test. In Figure 2(b), we show the ratio . Recall that is defined in 3.5 as the total number of samples used in the repeated application of Algorithm 1 to produce , and thus provides a comparison to the costs of directly sampling from . The results are in agreement with Theorems 3.2 and 3.4, which show in particular that with very high probability.

Using the same setup with as in (3.22), where , leads to the expected improved uniformity in . The corresponding results are shown in Figure 3, with sampling costs that are in agreement with (3.26) and Theorem 3.6. Since the effects of replacing by are very similar in all further tests, we only show results for with in what follows.

While the simple scheme in Algorithm 1 already ensures near-optimal costs with high probability, there are some practical variants that can yield better quantitative performance. A first such variant is given in Algorithm 2. Instead of deciding for each previous sample separately whether it will be re-used, here a queue of previous samples is kept, from which these are extracted in order until the previous sample set is exhausted. Clearly, the costs of this scheme are bounded from above by those of Algorithm 1.

As expected, in the results for Algorithm 2 applied in case (i), which are shown in Figure 4, we find an estimate of the distribution of that is essentially identical to the one for Algorithm 1 in Figure 2(a). The costs, however, are substantially more favorable than the ones in Figure 2(b): using Algorithm 2, the successive sampling of uses only a small fraction of additional samples when compared to directly sampling only .