 # Rate-optimal nonparametric estimation for random coefficient regression models

Random coefficient regression models are a popular tool for analyzing unobserved heterogeneity, and have seen renewed interest in the recent econometric literature. In this paper we obtain the optimal pointwise convergence rate for estimating the density in the linear random coefficient model over Hölder smoothness classes, and in particular show how the tail behavior of the design density impacts this rate. In contrast to previous suggestions, the estimator that we propose and that achieves the optimal convergence rate does not require dividing by a nonparametric density estimate. The optimal choice of the tuning parameters in the estimator depends on the tail parameter of the design density and on the smoothness level of the Hölder class, and we also study adaptive estimation with respect to both parameters.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper we consider the linear random coefficient regression model, in which i.i.d. (independent and identically distributed) data , are observed according to

 Yj=A0,j+A1,jXj. (1.1)

Therein

are unobserved i.i.d. random variables with the bivariate Lebesgue density

; while and are independent. Note that (1.1

) represents a randomized extension of the standard linear regression model. We shall derive the optimal convergence rates for estimating

over Hölder smoothness classes in case when the have a Lebesgue density with polynomial tail behaviour, as specified in Assumption 1.1 below.

From a parametric point of view with focus on means and variances of the random coefficients, a multivariate version of model (

1.1) is studied by . They assume the coefficients to be mutually independent. The nonparametric analysis of model (1.1) has been initiated by  and .  use Fourier methods to construct an estimator of

. They do not derive the optimal convergence rate, though. Furthermore, their estimator is rather involved as it requires a nonparametric estimator of a conditional characteristic function, which is then plugged into a regularized Fourier inversion.

Extensions of model (1.1) have seen renewed interest in the econometrics literature in recent years.  suggest a nonparametric estimator in a multivariate version of model (1.1). They only obtain its convergence rate for very heavy tailed regressors. Moreover, their estimator requires dividing by a nonparametric density estimator for a transformed version of the regressors. This involves an additional smoothing step, and potentially renders the estimator unstable.  propose a specification test for model (1.1) against a general nonseparable model as the alternative, while  suggest multiscale tests for qualitative hypotheses on . Extensions and modifications of model (1.1) are studied in , , , , , ,  and in .

In this paper, we consider the basic model (1.1) under the following condition.

###### Assumption 1.1 (Design density).

For some constants and , the density satisfies

 CX(1+|x|)−β−2≥fX(x)≥cX⋅(1+|x|)−β−2∀x∈R. (1.2)

We analyze precisely how the tail parameter of influences the optimal rate of convergence of at a given point in a minimax sense in case . Note that the heavy tailed setting which is studied in  corresponds to in Assumption 1.1. To our best knowledge a rigorous study of the minimax convergence rate in the more realistic case of is missing so far. In this paper we fill this gap and derive optimal rates, which are fundamentally new and not known from any other nonparametric estimation problem.

Inspired by , the estimator that we propose and that achieves the optimal convergence rate is a Priestley-Chao-type estimator in which we exploit the order statistics of a transformed version of the design variables. Thus, in particular, it does not require dividing by a nonparametric density estimator. The optimal choice of the tuning parameters depends both on the parameter and on the smoothness parameter of the Hölder class, which is reminiscent of the estimation problem in  and in contrast to usual adaptation problems in nonparametric curve estimation, in which the smoothing parameters shall adapt only to an unknown smoothness level. Here we show how to make the estimator adaptive with respect to both of these parameters.

The paper is organized as follows. In Section 2 we introduce our estimation procedure. Section 3 is devoted to upper and lower risk bounds, which yield minimax rate optimality; while Section 4 deals with adaptivity. The proofs and technical lemmata are deferred to Section 5.

Let us fix some notation: denotes the characteristic function of the , while is the conditional characteristic function of the random variable given the random variable .

## 2 The estimator

In order to construct an estimator for in model (1.1), we transform the data into via

 Uj=Yj/√1+X2j,(cosZj,sinZj)=(1,Xj)/√1+X2j,

so that a.s., and are independent, and

 Uj=A0,jcosZj+A1,jsinZj. (2.1)

Then the conditional characteristic function of given equals

 ψU|Z(t|z)=ψA(tcosz,tsinz). (2.2)

By Fourier inversion, integral substitution into polar coordinates (with signed radius) and (2.2) we deduce that

 fA(a) =1(2π)2∬exp(−ia′b)ψA(b)db =1(2π)2∫R∫π/2−π/2|t|exp(−it(a0cosz+a1sinz))ψU|Z(t|z)dzdt. (2.3)

The equation (2) motivates us to estimate by an empirical version of the conditional characteristic function which is directly accessible from the data . For that purpose choose a function which satisfies the following assumption.

###### Assumption 2.1 (Kernel).

For a number the function is even, compactly supported, times continuously differentiable on the whole real line and satisfies as well as for all .

Now we consider the regularized version of by kernel smoothing as follows

 ~fA(a;h) =1(2π)2∫R∫π/2−π/2w(th)|t|exp(−it(a0cosz+a1sinz))ψU|Z(t|z)dzdt =∫π/2−π/2∫RK(u−a0cosz−a1sinz;h)fU|Z(u|z)dudz, (2.4)

where

 K(x;h):=1(2π)2∫Rw(th)|t|exp(itx)dt. (2.5)

Inspired by (2.4) we introduce the following Priestley-Chao-type estimator of the density ,

 ^fA(a;h,δ) =n−1∑j=1K(U[j]−a0cosZ(j)−a1sinZ(j);h)(Z(j+1)−Z(j)) ⋅1−π/2+δ≤Z(j)≤Z(j+1)≤π/2−δ =1(2π)2∫Rw(th)|t|n−1∑j=1exp(it(U[j]−a0cosZ(j)−a1sinZ(j))) ⋅(Z(j+1)−Z(j))1−π/2+δ≤Z(j)≤Z(j+1)≤π/2−δdt, (2.6)

where , , denotes the sample , , sorted such that , and where is a classical bandwidth parameter and is a threshold parameter both of which remain to be selected. By the parameter we cut off the subset of the interval in which the are sparse.

## 3 Upper and lower risk bounds

We consider the following Hölder smoothness class of densities.

###### Definition 3.1.

For a point , a smoothness index and constants define the class of densities as follows: is Hölder-smooth of the degree in the neighborhood , that is, is -times continuously differentiable in and its partial derivatives satisfy

 ∣∣∂sfA∂xk∂ys−k(x,y)−∂sfA∂xk∂ys−k(a0,a1)∣∣≤cA⋅∣∣(x,y)−a∣∣α−s, (3.1)

for all and

. Furthermore, assume that the Fourier transform

of is weakly differentiable and its weak derivative satisfies

 ∫essupy∣∣∇ψA(x,y)∣∣dx≤cB, (3.2)

and that for all .

The first theorem provides an upper bound on the convergence rate for the estimator in (2.6).

###### Theorem 3.2.

Consider model (1.1) and assume that satisfies (1.2) for some . If satisfies Assumption 2.1 for , and if and are chosen such that

 δ≍n−1β+1 and h≍n−1(α+2)(β+1),

then the estimator (2.6) attains the following asymptotic risk upper bound over the function class ,

 supfA∈FEfA[∣∣^fA(a;h,δ)−fA(a)∣∣2]=O(n−2α(α+2)(β+1)).

The following theorem yields that the convergence rates which our estimator (2.6) achieves according to Theorem 3.2 are optimal for the pointwise risk in the minimax sense.

###### Theorem 3.3.

Fix and the constants , sufficiently large for any and . Let be an arbitrary sequence of estimators of where is based on the data , , for each . Assume that satisfies (1.2). Then

 liminfn→∞n2α(α+2)(β+1)supfA∈FEfA[∣∣^fn(0)−fA(0)∣∣2]>0.

The convergence rates from Theorem 3.2 and 3.3 differ significantly from standard rates in nonparametric estimation. While they become faster as increases, they become slower as gets larger. It is remarkable that they do not approach the (squared) parametric rate but the slower rate for large .

The case . An analysis of the proof of Theorem 3.2 shows that in case , choosing and gives the rate

 supfA∈FEfA[∣∣^fA(a;h,δ)−fA(a)∣∣2]=O(n−2α2α+4);

in case , an additional logarithmic factor occurs. The upper bound no longer depends on in this regime. For ,  obtain the faster rate ; their rate is in but could be transfered to a pointwise rate. However, they additionally impose the assumption that the density is uniformly bounded with a bounded support, which implies that is also uniformly bounded. Under this additional assumption, one can show that our estimator also achieves the rate for , even with the choice . See Remark 5.1.

### 4.1 Adaptation with respect to β for given smoothness

Assume that (1.2) holds with unknown . We consider the following selection rule for . Write

 ∑j,n,δ:=n−1∑j=11−π/2+δ≤Z(j)≤Z(j+1)≤π/2−δ (4.1)

for the sum over the indices for which . Further, if there are at least two observations in the interval so that is not empty, we set

 Ln(δ) =min{Zj:Zj≥−π/2+δ},Wn(δ)=max{Zj:Zj≤π/2−δ}, (4.2)

otherwise we put and . Define the function

 Cn(δ):= ∑j,n,δ(Z(j+1)−Z(j))2+δ−1∑j,n,δ(Z(j+1)−Z(j))3 +(Ln(δ)+π/2)2+(π/2−Wn(δ))2+δ2,

which is continuous except at the sites , and for . Now choose such that

 Cn(^δn)≤exp(−n)+infδ∈[1/n,π/4]Cn(δ). (4.3)

The next proposition shows that there is no loss in the convergence rate if only is unknown.

###### Proposition 4.1.

Consider model (1.1) and assume that satisfies (1.2) for some unknown . Choose satisfying the Assumption 2.1 for for given . If is chosen in (4.3) and

 ^hn=(Cn(^δn))12(α+2),

then the estimator attains the following asymptotic risk upper bound over the function class

 supfA∈FEfA[∣∣^fA(a;^hn,^δn)−fA(a)∣∣2]=O(n−2α(α+2)(β+1)).

### 4.2 Adaptation by the Lepski method

Finally we consider adaptivity with respect to both parameters and based on a combination of Lepski’s method, see  and , and the choice (4.3). Consider the grid of bandwidths

 hk=^δ1/2nqk,k∈Kn={0,…,K},

where , and is defined in (4.3). Fix and denote

 ^fk=^fA(a;hk,^δn).

For some constant to be chosen we let

 ^k =max{k∈Kn: |^fk−^fl|2≤CLepσ(l,n)∀ l≤k, l∈Kn},

where

 σ(k,n)=h−4kCn(^δn)logn,k∈Kn.
###### Theorem 4.2.

Consider model (1.1) and assume that satisfies (1.2) for some unknown . Choose according to Assumption 2.1 for some . Then for sufficiently large , we have for every that

where .

Thus a usual logarithmic penalty occurs in the pointwise rate under Hölder smoothness constraints.

## 5 Proofs

In the proofs we drop in and in from the notation.

### 5.1 Proofs for Section 3

###### Proof of Theorem 3.2.

By passing to Cartesian coordinates in (2.4) we can write

 ~fA(a;h)=1(2π)2∫R2exp(−ia′b)ψA(b)w(h∥b∥)db=(fA∗~w(⋅/h)/h2)(a),~w(a)=1(2π)2∫R2exp(−ia′b)w(∥b∥)db.

Assumption 2.1 guarantees that is a kernel of order . Then, using Taylor approximation as usual in kernel regularization, see p. 37–38 in  for the argument in case of non-compactly supported kernels, the following asymptotic rate of the regularization bias term occurs

 ∣∣fA(a)−~fA(a;h)∣∣ =∣∣fA(a)−∫κ(z)fA(a−hz)dz∣∣ ≤CBias(α,w,cA,cM)⋅hα, (5.1)

where the constant factor only depends on , , and .

Now let denote the -field generated by , and consider the conditional bias-variance decomposition that

 E[∣∣^fA(a;h,δ)−~fA(a;h)∣∣2]= E[Var(^fA(a;h,δ)|σZ)] +E[∣∣E[^fA(a;h,δ)|σZ]−~fA(a;h)∣∣2]

Since the are independent given the , observing from (2.5) that , we may bound

 Var(^fA(a;h,δ)|σZ) ≤∑j,n,δ(Z(j+1)−Z(j))2 ⋅∫RK2(u−a0cosZ(j)−a1sinZ(j);h)fU|Z(u|Z(j))du ≤const.⋅h−4⋅∑j,n,δ(Z(j+1)−Z(j))2, (5.2)

where the constant factor only depends on . Therein we use the notation (4.1). For the conditional expectation, we obtain that

 E[^fA(a;h,δ)|σZ] =1(2π)2∫Rw(th)|t|∫π/2−π/2~ψ(t,z)dzdt

where we set

 ~ψ(t,z)=∑j,n,δψU|Z(t|Z(j))exp(−ita0cosZ(j)−ita1sinZ(j)).

We deduce that

 ∣∣E[^fA(a;h,δ)|σZ]−~fA(a;h)∣∣2 ≤I1+I2+I3, (5.3)

where

 I1:= 3(2π)4∣∣∫Wn(δ)Ln(δ)∫Rw(th)|t|(~ψ(t,z)−exp(−it(a0cosz+a1sinz)) ⋅ψU|Z(t|z))dtdz∣∣2, I2:= 3(2π)4∣∣∫Ln(δ)−π/2∫Rw(th)|t|exp(−it(a0cosz+a1sinz)) ⋅ψU|Z(t|z)dzdt∣∣2, I3:= 3(2π)4∣∣∫π/2Wn(δ)∫Rw(th)|t|exp(−it(a0cosz+a1sinz)) ⋅ψU|Z(t|z)dzdt∣∣2,

where and are defined in (4.2). If there are no two consecutive in the interval , then (indeed ) and we may put and in the view of term .

First, consider the term . For simplicity, assume that is supported in and is bounded by . Using the Cauchy-Schwarz inequality, it holds that

 I3≤ 3(2π)4∫1/h−1/ht2dt∫1/h−1/h∣∣∫π/2Wn(δ)exp(−it(a0cosz+a1sinz)) ⋅ψU|Z(t|z)dz∣∣2dt ≤ 4(2π)4⋅h−4⋅(π/2−Wn(δ))2.

Analogously we establish that

 I2≤4(2π)4⋅h−4⋅(Ln(δ)−π/2)2.

Finally, consider the term . In case when there are two consecutive in the interval so that the sum in (4.1) is not empty, it holds that

 I1 =12(2π)4h−2⋅{∑j,n,δ∫Z(j+1)Z(j)∫|t|≤1/h∣∣~ψ(t,z)−exp(−it(a0cosz+a1sinz)) ⋅ψU|Z(t|z)∣∣dtdz}2

Now, for , we get that

 ∣∣ ~ψ(t,z)−exp(−it(a0cosz+a1sinz))ψU|Z(t|z)∣∣ =∣∣ψU|Z(t|Z(j))exp(−ita0cosZ(j)−ita1sinZ(j))−ψU|Z(t|z) ⋅exp(−it(a0cosz+a1sinz))∣∣ ≤∣∣ψU|Z(t|Z(j))−ψU|Z(t|z)∣∣+|t|⋅|a|⋅(Z(j+1)−Z(j)) =∣∣ψA(tcosZ(j),tsinZ(j))−ψA(tcosz,tsinz)∣∣+|t|⋅|a|⋅(Z(j+1)−Z(j)),

according to (2.2). Then, the term obeys the upper bound

 I1≤const.⋅{ |a|2⋅h−6⋅∑j,n,δ(Z(j+1)−Z(j))3 +h−4(∑j,n,δ(Z(j+1)−Z(j))∫Z(j+1)Z(j)1coszdz)2}. (5.4)

Again, the constant factor only depends on and . Using the Cauchy-Schwarz inequality yields that the second summand in (5.4) is bounded from above by

 h−4∫π/2−δ−π/2+δ1cos2zdz⋅∑j,n,δ(Z(j+1)−Z(j))3≍h−4δ−1⋅∑j,n,δ(Z(j+1)−Z(j))3.

Finally, if there are no two consecutive in the interval , we simply have , by uniform boundedness of and by restricting to . Collecting terms, we obtain that

 E [∣∣^fA(a;h,δ)−~fA(a;h)∣∣2∣∣σZ] ≤const.⋅h−4{(π/2−Wn(δ))2+(Ln(δ)+π/2)2 +∑j,n,δ(Z(j+1)−Z(j))2+δ−1⋅∑j,n,δ(Z(j+1)−Z(j))3} +const.⋅{|a|2h−6⋅∑j,n,δ(Z(j+1)−Z(j))3 +1(Z(j)<−π/2+δ or Z(j+1)>π/2−δ,j=1,…,n−1)}. (5.5)

Since ,

 ∫π/2δu−βdu≍δ1−β,∫π/2δu−2βdu≍δ1−2β.

From (5.1) and (5.5) and Lemma 5.1.1 we obtain for that

 E[∣∣^fA(a;h,δ)−fA(a)∣∣2] ≤const.⋅{h2α+h−4(δ+1cZnδβ)2+h−4n−1δ1−β+h−6n−2δ1−2β +h−4δ−1n−2δ1−2β+nexp(−cZ(n−1)(π/4)β)}.

Upon inserting the rates for and we obtain the result.

###### Remark 5.1.

If is uniformly bounded, then instead of (5.2) in our analysis, we can obtain the sharper bound

 VarfA(^fA(a;h,δ)|σZ) ≤const.⋅h−3⋅∑j,n,δ(Z(j+1)−Z(j))2

since , which eventually leads to the rate in case .

###### Proof of Theorem 3.3.

We introduce the functions

 fA,θ(a0,a1):=αnβnf0(αna0,βna1)+cL⋅θ⋅cos(2βna1)⋅αnβnϕ(αna0,βna1),

for , some constant and some sequences and which remain to be selected; moreover we specify

 f0(a0,a1):=1π2(1+a