# On estimation of nonsmooth functionals of sparse normal means

We study the problem of estimation of the value N_gamma(θ) = sum(i=1)^d |θ_i|^gamma for 0 < gamma <= 1 based on the observations y_i = θ_i + ϵξ_i, i = 1,...,d, where θ = (θ_1,...,θ_d) are unknown parameters, ϵ>0 is known, and ξ_i are i.i.d. standard normal random variables. We prove that the non-asymptotic minimax risk on the class B_0(s) of s-sparse vectors and we propose estimators achieving the minimax rate.

• 7 publications
• 5 publications
• 17 publications
12/10/2020

### Leveraging vague prior information in general models via iteratively constructed Gamma-minimax estimators

Gamma-minimax estimation is an approach to incorporate prior information...
06/22/2022

02/25/2020

### Structural adaptation in the density model

This paper deals with non-parametric density estimation on ^2 from i.i.d...
01/20/2022

### Entropies of sums of independent gamma random variables

We establish several Schur-convexity type results under fixed variance f...
06/24/2019

### A note on sum and difference of correlated chi-squared variables

Approximate distributions for sum and difference of linearly correlated ...
01/28/2020

### Rate of Estimation for the Stationary Distribution of Stochastic Damping Hamiltonian Systems with Continuous Observations

We study the problem of the non-parametric estimation for the density π ...
06/25/2015

### Minimax Structured Normal Means Inference

We provide a unified treatment of a broad class of noisy structure recov...

## 1. Introduction

In recent years, there has been a growing interest in statistical estimation of non-smooth functionals [1, 6, 13, 14, 7, 8, 2, 5]. Some of these papers deal with the normal means model [1, 2] addressing the problems of estimation of the -norm and of the sparsity index, respectively. In the present paper, we analyze a family of non-smooth functionals including, in particular, the -norm. We establish non-asymptotic minimax optimal rates of estimation on the classes of sparse vectors and we construct estimators achieving these rates.

Assume that we observe

 (1) yi=θi+εξi,i=1,…,d,

where is an unknown vector of parameters, is a known noise level, and are i.i.d. standard normal random variables. We consider the problem of estimating the functionals

 Nγ(θ)=d∑i=1|θi|γ,0<γ≤1,

assuming that the vector is -sparse, that is, belongs to the class

 B0(s)={θ∈Rd:∥θ∥0≤s}.

Here, denotes the number of nonzero components of and . We measure the accuracy of an estimator of by the maximal quadratic risk over :

 supθ∈B0(s)Eθ[(^T−Nγ(θ))2].

Here and in the sequel, we denote by

the expectation with respect to the joint distribution

of satisfying (1).

In this paper, for all we propose rate optimal estimators in a non-asymptotic minimax sense, that is, estimators such that

where denotes the infimum over all estimators and, for two quantities and possibly depending on , we write if there exist positive constants that may depend only on such that . We also establish the following explicit non-asymptotic characterization of the minimax risk :

 (2)

Note that the rate on the right hand side of (2) is an increasing function of , which is slightly greater than for much smaller than , equal to for , and slightly smaller than for much greater than .

In the case , , the same minimax risk was studied in Cai and Low [1], where it was proved that

 Rd,d(1,1)=inf^Tsupθ∈RdEθ[(^T−N1(θ))2]≍d2logd

and also claimed that for with , which agrees with (2).

We see from (2) that, for the general sparsity classes and any , there exist two different regimes with an elbow at . We call them the sparse zone and the dense zone. The estimation methods for these two regimes are quite different. In the sparse zone, where is smaller than , we show that one can use suitably adjusted thresholding to achieve optimality. In this zone, rate optimal estimators can be obtained based on the techniques developed in [3] to construct minimax optimal estimators of linear and quadratic functionals. In the dense zone, where is greater than , we use another approach. We follow the general scheme of estimation of non-smooth functionals from [9] and our construction is especially close in the spirit to [1]. Specifically, we consider the best polynomial approximation of the function

in a neighborhood of the origin and plug in unbiased estimators of the coefficients of this polynomial. Outside of this neighborhood, for

such that is, roughly speaking, greater than the ”noise level” of the order , we use as an estimator of . The main difference from the estimator suggested in [1] for lies in the fact that, for the polynomial approximation part, we need to introduce a block structure with exponentially increasing blocks and carefully chosen thresholds depending on . This is needed to achieve optimal bounds for all in the dense zone and not only for (or comfortably greater than ).

This paper is organized as follows. In Section 2, we introduce the estimators and state the upper bounds for their risks. Section 3 provides the matching lower bounds. The rest of the paper is devoted to the proofs. In particular, some useful results from approximation theory are collected in Section 6.

## 2. Definition of estimators and upper bounds for their risks

In this section, we propose two different estimators, for the dense and sparse regimes defined by the inequalities and , respectively. Recall that, in the Introduction, we used the inequalities and , respectively, to define the two regimes. The factor 4 that we introduce in the definition here is a matter of convenience for the proofs. We note that such a change does not influence the final result since the optimal rate (cf. (2)) is the same, up to a constant, for all such that .

### 2.1. Dense zone: s2≥4d

For any positive integer , we denote by the best approximation of by polynomials of degree at most on the interval , that is

where is the class of all real polynomials of degree at most . Since is an even function, it suffices to consider approximation by polynomials of even degree. The quality of the best polynomial approximation of is described by Lemma 7 below.

We denote by the coefficients of the canonical representation of :

 Pγ,K(x)=K∑k=0aγ,2kx2k,x∈R,

and by the th Hermite polynomial

 Hk(x)=(−1)kex2/2dkdxke−x2/2,k∈N,x∈R.

To construct the estimator in the dense zone, we use the sample duplication device, i.e., we transform into randomized observations as follows. Let be i.i.d. random variables such that and are independent of . Set

 y1,i=yi+zi, y2,i=yi−zi,i=1,…,d.

Then, , for where and the random variables are mutually independent.

Define the estimator of as follows:

 (3) ^Nγ=d∑i=1ξγ(y1,i,y2,i)

where

 ξγ(u,v)=L∑l=0^Pγ,Kl,Ml(u)1σtl−1<|v|≤σtl+|u|γ1|v|>σtL,

and

 (4) ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩ ^Pγ,K,M(u)=∑Kk=1σ2kaγ,2kMγ−2kH2k(u/σ), Kl=4lclog(s2/d), Ml=2l+1σ√2log(s2/d), tl=2l√2log(s2/d),t−1=0, L is the smallest integer such that 2L≥3√log(d)/log(s2/d).

Here and in what follows denotes the indicator function, and is a constant that will be chosen small enough (see the proof of Theorem 1 below).

We will show that the estimator is optimal in a non-asymptotic minimax sense on the class in the dense zone. The next theorem provides an upper bound on the risk of in this zone.

###### Theorem 1.

Let the integers and be such that and let . Then the estimator defined in (3) satisfies

 supθ∈B0(s)Eθ[(^Nγ−Nγ(θ))2]≤Cε2γs2logγ(s2/d),

where is a constant depending only on .

### 2.2. Sparse zone: s2≤4d

If belongs to the sparse zone we do not invoke the sample duplication and we use the estimator

 (5) ^N∗γ=d∑i=1{|yi|γ−εγαγ}1y2i>2ε2log(1+d/s2),

where

 αγ=E(|ξ|γ1ξ2>2log(1+d/s2))P(ξ2>2log(1+d/s2))forξ∼N(0,1).

The next theorem establishes an upper bound on the risk of this estimator.

###### Theorem 2.

Let the integers and be such that and . Then the estimator defined in (5) satisfies

 supθ∈B0(s)Eθ[(^N∗γ−Nγ(θ))2]≤Cε2γs2logγ(1+d/s2),

where is a constant depending only on .

Note that, intuitively, the optimal estimator in the sparse zone can be viewed as an example of applying the following routine developed in [3]. We start from the optimal estimator in the case and we threshold every term. Then, we center every term by its mean under the assumption that there is no signal. Finally, we choose a threshold that makes the best compromise between the first and second type errors in the support estimation problem. The only subtle ingredient in applying this argument in the present context is that we drop the polynomial part, which would almost always be removed by thresholding. In fact, one can notice that the polynomial approximation is only useful in a neighborhood of but in the sparse zone we renounce to estimating small instances of .

## 3. Lower bounds

We denote by the set of all monotone non-decreasing functions such that and .

###### Theorem 3.

Let be integers such that and let

be any loss function in the class

. There exist positive constants and depending only on and such that

 inf^Tsupθ∈B0(s)Eθℓ(c1(εγslogγ2(1+d/s2))−1|^T−Nγ(θ)|)≥c2,

where denotes the infimum over all estimators.

The proof follows the lines of the proof of the lower bound in [3, Theorem 1] with the only difference that should be replaced by . Note that though Theorem 3 is valid for all the bound becomes suboptimal in the dense zone.

###### Theorem 4.

Let be integers such that and let be any loss function in the class . There exist positive constants and depending only on and and a constant depending only on such that, if , then

 inf^Tsupθ∈B0(s)Eθℓ(c1(εγslogγ2(s2/d))−1|^T−Nγ(θ)|)≥c2.

where denotes the infimum over all estimators.

In the case of quadratic loss , combining these two theorems with the bounds of Theorems 1 and 2, immediately leads to the relation (2).

## 4. Proofs of the upper bounds

Throughout the proofs, we denote by positive constants that can depend only on and may take different values on different appearances.

### 4.1. Proof of Theorem 1

Denote by the support of

 (^Nγ−Nγ(θ))2 ≤4(∑i∈SEθξγ(y1,i,y2,i)−∑i∈S|θi|γ)2 +4(∑i∈Sξγ(y1,i,y2,i)−∑i∈SEθξγ(y1,i,y2,i))2 +4(∑i∉SEθξγ(y1,i,y2,i))2 +4(∑i∉Sξγ(y1,i,y2,i)−∑i∉SEθξγ(y1,i,y2,i))2

 (6) Eθ[(^Nγ−Nγ(θ))2] ≤4s2maxi∈SB2i+4smaxi∈SVi +4d2maxi∉SB2i+4dmaxi∉SVi,

where is the bias of as an estimator of and is its variance. We now bound separately the four terms in (6).

Bias for . If , then using Lemma 2 we obtain

 |Bi| =σγE|ξ|γP(|ξ|>tL)≤Cσγe−t2L/2,ξ∼N(0,1).

The last exponential is smaller than by the definition of , so that

 (7) d2maxi∉SB2i≤Cσ2γd≤Cσ2γs2log(s2/d).

Variance for . If , then

 (8) Vi≤L∑l=0E^P2γ,Kl,Ml(σξ)P(|ξ|>tl−1)+σ2γE|ξ|2γP(|ξ|>tL),ξ∼N(0,1).

The last term in (8) is bounded from above as in item Next, in view of Lemma 3,

 E^P2γ,K0,M0(σξ)≤Cσ2γ62K0(M0/σ)2≤Cσ2γlog(s2/d)(s2d)2clog6≤Cσ2γs2dlog(s2/d)

if is chosen such that . Here, we use the assumption . For , we use Lemma 3 to obtain

 E^P2γ,Kl,Ml(σξ)P(|ξ|>tl−1) ≤Cσ2γ62Kle−t2l−1/2(Ml/σ)2 ≤Cσ2γ4llog(s2/d)(s2d)(2clog6−1/4)4l≤Cσ2γ4llog(s2/d)

if we chose such that . In conclusion, under this choice of , using the facts that and we get

 (9) dmaxi∉SVi≤Cσ2γs2logγ(s2/d).

Bias for . If , the bias has the form

 Bi

where . We will analyze this expression separately in three different ranges of values of .

Case . In this case, we use the bound

 |Bi| ≤maxl∣∣E^Pγ,Kl,Ml(X)−|θi|γ∣∣+∣∣E|X|γ−|θi|γ∣∣P(|X|>σtL),

where . Since for all , we can use Lemma 4 to obtain

 (10) ∣∣E^Pγ,Kl,Ml(X)−|θi|γ∣∣≤C(MlKl)γ≤Cσγlogγ/2(s2/d).

In addition, using Lemma 1 we get

 ∣∣E|X|γ−|θi|γ∣∣ P(|X|>σtL) ≤CσγP(|ξ|>tL−|θi|/σ) ≤CσγP(|ξ|>t0)≤Cσγlog(s2/d)

where and we have used the inequalities and . It follows that

 (11) s2max0<|θi|<2σt0B2i≤Cσ2γs2logγ(s2/d).

Case . Let be the integer such that . We have

 (12) |Bi| ≤l0−1∑l=0∣∣E^Pγ,Kl,Ml(X)−|θi|γ∣∣⋅P(σtl−1<|X|≤σtl) +maxl≥l0∣∣E^Pγ,Kl,Ml(X)−|θi|γ∣∣+∣∣E|X|γ−|θi|γ∣∣,

where . Analogously to (10) we find

 maxl≥l0∣∣E^Pγ,Kl,Ml(X)−|θi|γ∣∣≤Cσγlogγ/2(s2/d).

Next, Lemma 1 and the fact that imply

 (13) ∣∣E|X|γ−|θi|γ∣∣ ≤Cσγ(σ/|θi|)2−γ ≤Cσγlogγ/2−1(s2/d)≤Cσγlogγ/2(s2/d).

Finally, we consider the first sum on the right hand side of (12). Notice that

 P(σtl−1<|X|≤σtl)≤e−θ2i8σ2,l=0,…,l0−1,

since for . Using these inequalities and Lemma 5 we get

 l0−1∑l=0∣∣E^Pγ,Kl,Ml(X)∣∣⋅P(σtl−1<|X|≤σtl) ≤Cσγl0−1∑l=06KlK3/2le(c−1)θ2i/(8σ2) ≤Cσγl0−1∑l=0t3le(clog6+c−1)t2l/2.

Choose such that . As , this yields

 l0−1∑l=0∣∣E^Pγ,Kl,Ml(X)∣∣⋅P(σtl−1<|X|≤σtl) ≤Cσγe−(1/2)log(s2/d) ≤Cσγlogγ/2(s2/d).

Furthermore,

 (14) l0−1∑l=0|θi|γP(σtl−1<|X|≤σtl) ≤l0|θi|γe−θ2i8σ2 ≤Clog(θ2i2σ2log(s2/d))|θi|γe−θ2i8σ2 ≤Cσγe−θ2i16σ2,

where we have used that . Since , this also implies that (14) does not exceed

 Cσγlog1/2(s2/d)≤Cσγlogγ/2(s2/d).

Combining the above arguments yields

 (15) s2max2σt0<|θi|≤2σtLB2i≤Cσ2γs2logγ(s2/d).

Case . Recall that the bias has the form

 Bi

where . Using Lemma 5 we get

 ∣∣L∑l=0E^Pγ,Kl,Ml(X)P(σtl−1<|X|≤σtl)∣∣≤ maxl=0,…,L∣∣E^Pγ,Kl,Ml(X)∣∣P(|X|≤σtL) ≤Cσγ6KLK3/2Lecθ2i/(8σ2)e−θ2i/(8σ2) ≤Cσγ(logd)3/269clogde9(c−1)logd

and the last upper bound is smaller than if is small enough. On the other hand, it follows from (13) that . Thus,

 ∣∣E|X|γP(|X|>σtL)−|θi|γ∣∣ ≤∣∣E|X|γ−|θi|γ∣∣+|θi|γP(|X|≤σtL) ≤Cσγlog−γ/2(s2/d)+|θi|γe−θ2i8σ2 ≤Cσγlog−γ/2(s2/d).

Finally, we get

 (16) s2max|θi|>2σtLB2i≤Cσ2γs2logγ(s2/d).

Variance for . We consider the same three cases as in item above. For the first two cases, it suffices to use a coarse bound granting that, for all ,

 (17) Vi≤Eθ[ξ2γ(y1,i,y2,i)]=L∑l=0E^P2γ,Kl,Ml(X)P(σtl−1<|X|≤σtl)+E|X|2γP(|X|>σtL)

where .

Case . In this case, we deduce from (17) that

 Vi≤maxl=0,…,LE^P2γ,Kl,Ml(X)+E|X|2γ,

where . Lemma 4 and the fact that imply

 Vi ≤CM2γL28KL+σ2γ+|θi|2γ ≤Cσ2γlogγ(d)d72clog2+Cσ2γlog(s2/d).

Hence, if is small enough, we conclude that

 (18) smax0<|θi|<2σt0Vi≤Cσ2γs2logγ(s2/d).

Case . As in item above, we denote by the integer such that . We deduce from (17) that

 Vi ≤maxl=0,…,l0−1E^P2γ,Kl,Ml(X)P(|X|≤σtl0−1)+maxl=l0,…,LE^P2γ,Kl,Ml(X)+E|X|2γ,

where . The last two terms on the right hand side are controlled as in item . For the first term, we find using Lemma 5 that, for ,

 (19) maxl=0,…,l0−1E^P2γ,Kl,Ml(X)P(|X|≤σtl0−1) ≤Cσ2γ(σ/M0)4−2γ62Kl0−1eclog(1+4/c)θ2i/(4σ2)e−θ2i/(8σ2) ≤Cσ2γlog−1(s2/d)e(clog6+4clog(1+4/c)−1/2)t2l0−1.

Choosing small enough allows us to obtain the desired bound

 (20) smax2σt0<|θi|≤2σtLVi≤Cσ2γs2logγ(s2/d).

Case . We first note that

 Var(|y1,i|γ1|y2,i|>σtL) =P(