Rate-optimal nonparametric estimation for random coefficient regression models

02/14/2019 ∙ by Hajo Holzmann, et al. ∙ Philipps-Universität Marburg 0

Random coefficient regression models are a popular tool for analyzing unobserved heterogeneity, and have seen renewed interest in the recent econometric literature. In this paper we obtain the optimal pointwise convergence rate for estimating the density in the linear random coefficient model over Hölder smoothness classes, and in particular show how the tail behavior of the design density impacts this rate. In contrast to previous suggestions, the estimator that we propose and that achieves the optimal convergence rate does not require dividing by a nonparametric density estimate. The optimal choice of the tuning parameters in the estimator depends on the tail parameter of the design density and on the smoothness level of the Hölder class, and we also study adaptive estimation with respect to both parameters.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper we consider the linear random coefficient regression model, in which i.i.d. (independent and identically distributed) data , are observed according to

(1.1)

Therein

are unobserved i.i.d. random variables with the bivariate Lebesgue density

; while and are independent. Note that (1.1

) represents a randomized extension of the standard linear regression model. We shall derive the optimal convergence rates for estimating

over Hölder smoothness classes in case when the have a Lebesgue density with polynomial tail behaviour, as specified in Assumption 1.1 below.

From a parametric point of view with focus on means and variances of the random coefficients, a multivariate version of model (

1.1) is studied by [10]. They assume the coefficients to be mutually independent. The nonparametric analysis of model (1.1) has been initiated by [3] and [4]. [2] use Fourier methods to construct an estimator of

. They do not derive the optimal convergence rate, though. Furthermore, their estimator is rather involved as it requires a nonparametric estimator of a conditional characteristic function, which is then plugged into a regularized Fourier inversion.

Extensions of model (1.1) have seen renewed interest in the econometrics literature in recent years. [12] suggest a nonparametric estimator in a multivariate version of model (1.1). They only obtain its convergence rate for very heavy tailed regressors. Moreover, their estimator requires dividing by a nonparametric density estimator for a transformed version of the regressors. This involves an additional smoothing step, and potentially renders the estimator unstable. [5] propose a specification test for model (1.1) against a general nonseparable model as the alternative, while [6] suggest multiscale tests for qualitative hypotheses on . Extensions and modifications of model (1.1) are studied in [1], [7], [8], [9], [11], [16], [17] and in [18].

In this paper, we consider the basic model (1.1) under the following condition.

Assumption 1.1 (Design density).

For some constants and , the density satisfies

(1.2)

We analyze precisely how the tail parameter of influences the optimal rate of convergence of at a given point in a minimax sense in case . Note that the heavy tailed setting which is studied in [12] corresponds to in Assumption 1.1. To our best knowledge a rigorous study of the minimax convergence rate in the more realistic case of is missing so far. In this paper we fill this gap and derive optimal rates, which are fundamentally new and not known from any other nonparametric estimation problem.

Inspired by [11], the estimator that we propose and that achieves the optimal convergence rate is a Priestley-Chao-type estimator in which we exploit the order statistics of a transformed version of the design variables. Thus, in particular, it does not require dividing by a nonparametric density estimator. The optimal choice of the tuning parameters depends both on the parameter and on the smoothness parameter of the Hölder class, which is reminiscent of the estimation problem in [13] and in contrast to usual adaptation problems in nonparametric curve estimation, in which the smoothing parameters shall adapt only to an unknown smoothness level. Here we show how to make the estimator adaptive with respect to both of these parameters.

The paper is organized as follows. In Section 2 we introduce our estimation procedure. Section 3 is devoted to upper and lower risk bounds, which yield minimax rate optimality; while Section 4 deals with adaptivity. The proofs and technical lemmata are deferred to Section 5.

Let us fix some notation: denotes the characteristic function of the , while is the conditional characteristic function of the random variable given the random variable .

2 The estimator

In order to construct an estimator for in model (1.1), we transform the data into via

so that a.s., and are independent, and

(2.1)

Then the conditional characteristic function of given equals

(2.2)

By Fourier inversion, integral substitution into polar coordinates (with signed radius) and (2.2) we deduce that

(2.3)

The equation (2) motivates us to estimate by an empirical version of the conditional characteristic function which is directly accessible from the data . For that purpose choose a function which satisfies the following assumption.

Assumption 2.1 (Kernel).

For a number the function is even, compactly supported, times continuously differentiable on the whole real line and satisfies as well as for all .

Now we consider the regularized version of by kernel smoothing as follows

(2.4)

where

(2.5)

Inspired by (2.4) we introduce the following Priestley-Chao-type estimator of the density ,

(2.6)

where , , denotes the sample , , sorted such that , and where is a classical bandwidth parameter and is a threshold parameter both of which remain to be selected. By the parameter we cut off the subset of the interval in which the are sparse.

3 Upper and lower risk bounds

We consider the following Hölder smoothness class of densities.

Definition 3.1.

For a point , a smoothness index and constants define the class of densities as follows: is Hölder-smooth of the degree in the neighborhood , that is, is -times continuously differentiable in and its partial derivatives satisfy

(3.1)

for all and

. Furthermore, assume that the Fourier transform

of is weakly differentiable and its weak derivative satisfies

(3.2)

and that for all .

The first theorem provides an upper bound on the convergence rate for the estimator in (2.6).

Theorem 3.2.

Consider model (1.1) and assume that satisfies (1.2) for some . If satisfies Assumption 2.1 for , and if and are chosen such that

then the estimator (2.6) attains the following asymptotic risk upper bound over the function class ,

The following theorem yields that the convergence rates which our estimator (2.6) achieves according to Theorem 3.2 are optimal for the pointwise risk in the minimax sense.

Theorem 3.3.

Fix and the constants , sufficiently large for any and . Let be an arbitrary sequence of estimators of where is based on the data , , for each . Assume that satisfies (1.2). Then

The convergence rates from Theorem 3.2 and 3.3 differ significantly from standard rates in nonparametric estimation. While they become faster as increases, they become slower as gets larger. It is remarkable that they do not approach the (squared) parametric rate but the slower rate for large .

The case . An analysis of the proof of Theorem 3.2 shows that in case , choosing and gives the rate

in case , an additional logarithmic factor occurs. The upper bound no longer depends on in this regime. For , [12] obtain the faster rate ; their rate is in but could be transfered to a pointwise rate. However, they additionally impose the assumption that the density is uniformly bounded with a bounded support, which implies that is also uniformly bounded. Under this additional assumption, one can show that our estimator also achieves the rate for , even with the choice . See Remark 5.1.

4 Adaptation

4.1 Adaptation with respect to for given smoothness

Assume that (1.2) holds with unknown . We consider the following selection rule for . Write

(4.1)

for the sum over the indices for which . Further, if there are at least two observations in the interval so that is not empty, we set

(4.2)

otherwise we put and . Define the function

which is continuous except at the sites , and for . Now choose such that

(4.3)

The next proposition shows that there is no loss in the convergence rate if only is unknown.

Proposition 4.1.

Consider model (1.1) and assume that satisfies (1.2) for some unknown . Choose satisfying the Assumption 2.1 for for given . If is chosen in (4.3) and

then the estimator attains the following asymptotic risk upper bound over the function class

4.2 Adaptation by the Lepski method

Finally we consider adaptivity with respect to both parameters and based on a combination of Lepski’s method, see [14] and [15], and the choice (4.3). Consider the grid of bandwidths

where , and is defined in (4.3). Fix and denote

For some constant to be chosen we let

where

Theorem 4.2.

Consider model (1.1) and assume that satisfies (1.2) for some unknown . Choose according to Assumption 2.1 for some . Then for sufficiently large , we have for every that

where .

Thus a usual logarithmic penalty occurs in the pointwise rate under Hölder smoothness constraints.

5 Proofs

In the proofs we drop in and in from the notation.

5.1 Proofs for Section 3

Proof of Theorem 3.2.

By passing to Cartesian coordinates in (2.4) we can write

Assumption 2.1 guarantees that is a kernel of order . Then, using Taylor approximation as usual in kernel regularization, see p. 37–38 in [19] for the argument in case of non-compactly supported kernels, the following asymptotic rate of the regularization bias term occurs

(5.1)

where the constant factor only depends on , , and .

Now let denote the -field generated by , and consider the conditional bias-variance decomposition that

Since the are independent given the , observing from (2.5) that , we may bound

(5.2)

where the constant factor only depends on . Therein we use the notation (4.1). For the conditional expectation, we obtain that

where we set

We deduce that

(5.3)

where

where and are defined in (4.2). If there are no two consecutive in the interval , then (indeed ) and we may put and in the view of term .

First, consider the term . For simplicity, assume that is supported in and is bounded by . Using the Cauchy-Schwarz inequality, it holds that

Analogously we establish that

Finally, consider the term . In case when there are two consecutive in the interval so that the sum in (4.1) is not empty, it holds that

Now, for , we get that

according to (2.2). Then, the term obeys the upper bound

(5.4)

Again, the constant factor only depends on and . Using the Cauchy-Schwarz inequality yields that the second summand in (5.4) is bounded from above by

Finally, if there are no two consecutive in the interval , we simply have , by uniform boundedness of and by restricting to . Collecting terms, we obtain that

(5.5)

Since ,

From (5.1) and (5.5) and Lemma 5.1.1 we obtain for that

Upon inserting the rates for and we obtain the result.

Remark 5.1.

If is uniformly bounded, then instead of (5.2) in our analysis, we can obtain the sharper bound

since , which eventually leads to the rate in case .

Proof of Theorem 3.3.

We introduce the functions

for , some constant and some sequences and which remain to be selected; moreover we specify