Adaptive estimation in the linear random coefficients model when regressors have limited variation

We consider a linear model where the coefficients-intercept and slopes-are random and independent from regressors which support is a proper subset. When the density has finite weighted L 2 norm, for well chosen weights, the joint density of the random coefficients is identified. Lower bounds on the supremum risk for the estimation of the density are derived for this model and a related white noise model. We present an estimator, its rates of convergence, and a data-driven rule which delivers adaptive estimators. An R package RandomCoefficients that implements our estimator is available on CRAN.R.

Authors

• 2 publications
• 6 publications
02/01/2021

Data-driven aggregation in circular deconvolution

In a circular deconvolution model we consider the fully data driven dens...
05/25/2021

Nonparametric classes for identification in random coefficients models when regressors have limited variation

This paper studies point identification of the distribution of the coeff...
07/30/2020

Adaptive nonparametric estimation of a component density in a two-class mixture model

A two-class mixture model, where the density of one of the components is...
09/28/2018

Minimax Lower Bounds for H_∞-Norm Estimation

The problem of estimating the H_∞-norm of an LTI system from noisy input...
12/23/2019

An improper estimator with optimal excess risk in misspecified density estimation and logistic regression

We introduce a procedure for predictive conditional density estimation u...
06/27/2012

Statistical Linear Estimation with Penalized Estimators: an Application to Reinforcement Learning

Motivated by value function estimation in reinforcement learning, we stu...
05/06/2020

Dependence structure estimation using Copula Recursive Trees

We construct the Copula Recursive Tree (CORT) estimator: a flexible, con...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

For a random variable

and random vectors

and of dimension , the linear random coefficients model is

 (1) Y=α+β⊤X, (2) (α,β⊤) and X are independent.

The researcher has at her disposal observations of but does not observe the realizations of . subsumes the intercept and error term and the vector of slope coefficients is heterogeneous (i.e., varies across ). For example, a researcher interested in the effect of class size on pupils’ achievements might want to allow some pupils to be more sensitive than others to a decrease in the size and to estimate the density of the effect. correspond to multidimensional unobserved heterogeneity and to observed heterogeneity. Restricting unobserved heterogeneity to a scalar, as when only is random, can have undesirable implications such as monotonicity in the literature on policy evaluation (see [24]). Parametric assumptions are often made by convenience and can drive the results (see [29]). For this reason, this paper considers a nonparametric setup. Model (1

) is also a type of linear model with homegeneous slopes and heteroscedasticity, hence the averages of the coefficients are easy to obtain. However, the law of coefficients, their quantiles, prediction intervals for

for as in [3], welfare measures, treatment and counterfactual effects, which depend on the distribution of the coefficients can be of great interest.

Estimation of the density of random coefficients when the support of is and has heavy enough tails has been studied in [4, 31]. These papers notice that the inverse problem is related to a tomography problem (see, e.g., [11, 12]) involving the Radon transform. Assuming the support of is amounts to assuming that the law of angles has full support, moreover a lower bound on the density of is assumed so that the law of the angles is nondegenerate. When this is implied by densities of

which follow a Cauchy distribution. The corresponding tomography problem has a nonuniform and estimable density of angles and the dimension can be larger than in tomography due to more than one regressor. More general specifications of random coefficients model are important in econometrics (see,

e.g., [25, 30] and references therein) and there has been recent interest in nonparametric tests (see [10, 19]).

This paper considers the case where the support of is a proper (i.e., strict) subset. This is a much more useful and realistic framework for the random coefficients model. When , this is related to limited angle tomography (see, e.g., [20, 32]). There, one has measurements over a subset of angles and the unknown density has support in the unit disk. This is too restrictive for a density of random coefficients and implies that has compact support, ruling out usual parametric assumptions on error terms. Due to (2

), the conditional characteristic function of

given at

is the Fourier transform of

at . Hence, the family of conditional characteristic functions indexed by in the support of gives access to the Fourier transform of on a double cone of axis and apex 0. When , is compact, and

is an arbitrary compact set of nonempty interior, this is the problem of out-of-band extrapolation or super-resolution (see,

e.g., [5] sections 11.4 and 11.5). Because we allow to be nonzero, we generalize this approach. Estimation of is a statistical inverse problem for which the deterministic problem is the inversion of a truncated Fourier transform (see, e.g., [2] and the references therein). The companion paper [23] presents conditions on the law of and the support of that imply nonparametric identification. It considers weak conditions on

which could have infinite absolute moments and the marginals of

could have heavy tails. In this paper, we obtain rates of convergence when the marginals of do not have heavy tails but can have noncompact support.

A related approach is extrapolation. It is used in [41]

to perform deconvolution of compactly supported densities while allowing the Fourier transform of the error density to vanish on a set of positive measure. In this paper, the relevant operator is viewed as a composition of two operators based on partial Fourier transforms. One involves a truncated Fourier transform and we make use of properties of the singular value decomposition rather than extrapolation.

Similar to [26, 33]

, we study optimality in the minimax sense. We obtain lower bounds under weak to strong integrability in the first argument for this and a white noise model. We present an estimator involving: series based estimation of the partial Fourier transform of the density with respect to the first variable, interpolation around zero, and inversion of the partial Fourier transform. We give rates of convergence and use a Goldenshluger-Lepski type method to obtain data-driven estimators. We consider estimation of

in Appendix B.5. We present a numerical method to compute the estimator which is implemented in the R package RandomCoefficients.

2. Notations

and stand for the positive and nonnegative integers, for , (resp. ) for the minimum (resp. maximum) between and , and for the indicator function. Bold letters are used for vectors. For all , is the vector, which dimension will be clear from the text, where each entry is . The iterated logarithms are and, for and large enough, . for stands for the norm of a vector. For all , functions with values in , and , denote by , , and . For a differentiable function of real variables, denotes and its support. is the space of infinitely differentiable functions. The inverse of a mapping , when it exists, is denoted by . We denote the interior of by and its closure by . When is measurable and a function from to , is the space of complex-valued square integrable functions equipped with . This is denoted by when . When , we have and . Denote by the set of densities, by such that , and by the product of functions (e.g., ) or measures. The Fourier transform of is and is also the Fourier transform in . For all , denote the Paley-Wiener space by , by the projector from to (), and, for all , by

 (3)

Abusing notations, we sometimes use for the function in . assigns the value 0 outside and is the partial Fourier transform of with respect to the first variable. For a random vector , is its law, its density, the truncated density of given , its support, and the conditional density. For a sequence of random variables , means that, for all , there exists such that for all such that holds. In the absence of constraint, we drop the notation . With a single index the

notation requires a bound holding for all value of the index (the usual notation if the random variables are bounded in probability).

3. Preliminaries

Assumption 1.
1. and exist;

2. , where and is even, nondecreasing on , such that and , with ;

3. There exists and and we have at our disposal i.i.d and an estimator based on independent of ;

4. is a set of densities on such that, for , for all , and , and, for which tends to 0, we have

 1v(n0,E)supfX|X∈E∥∥ˆfX|X−fX|X∥∥2L∞(X)=Op(1).

We maintain this assumption for all results presenting upper bounds. When , , for , might not exist. Due to Theorem 3.14 in [18], if there exist , , and equal to 0 for large enough, such that

for all , then which implies (H1.2

). Marginal distributions can have an infinite moment generating function hence be heavy-tailed and their Fourier transforms belong to a quasi-analytic class but not be analytic. Now on, we use

or for . This rules out heavy tails and nonanalytic Fourier transforms. When , integrability in amounts to , but other allow for non compact . Though with a different scalar product, we have and (see Theorem IX.13 in [45]), for , is the set of square-integrable functions which Fourier transform have an analytic continuation on . In particular the Laplace transform is finite near 0. Equivalently, if is a density, it does not have heavy-tails. The condition in (H1.4) is not restrictive because we can write (1) as , take and such that , and there is a one-to-one mapping between and . We assume (H1.4) because the estimator involves estimators of in denominators. Alternative solutions exist when (see, e.g., [36]) only. Assuming the availability of an estimator of using the preliminary sample is common in the deconvolution literature (see, e.g., [15]). By using estimators of for a well chosen rather than of , the assumption that and in (H1.4) becomes very mild. This is feasible because of (2).

3.1. Inverse problem in Hilbert spaces

Estimation of is a statistical ill-posed inverse problem. The operator depends on and . Now on, the functions and are those of (H1.2). We have, for all and , , where

 (4) K:L2(w⊗W⊗p)→L2(R×[−1,1]p)f→(t,u)↦F[f](t,tx0u)|tx0|p/2.
Proposition 1.

is continuously embedded into . Moreover, is injective and continuous, and not compact if .

The case corresponds to mild integrability assumptions in the first variable when the SVD of does not exist. This makes it difficult to prove rates of convergence even for estimators which do not rely explicitly on the SVD such as the Tikhonov and Landweber method (Gerchberg algorithm in out-of-band extrapolation, see, e.g., [5]). Rather than work with directly, we use that is the composition of operators which are easier to analyze

 (5) for t∈R, K[f](t,⋆)=Ftx0[F1st[f](t,⋅)](⋆)|tx0|p/2 in L2([−1,1]p).

For all , either or , and , belongs to and, for ,

admits a SVD, where both orthonormal systems are complete. This is a tensor product of the SVD when

that we denote by , where is in decreasing order repeated according to multiplicity, and are orthonormal systems of, respectively, and . This holds for the following reason. Because , , , and is even, we obtain and . The operator is a compact positive definite self-adjoint operator (see [44] and [49] for the two choices of

). Its eigenvalues in decreasing order repeated according to multiplicity are denoted by

and a basis of eigenfunctions by

. The other elements of the SVD are and .

Proposition 2.

For all , is a basis of .

The singular vectors are the Prolate Spheroidal Wave Functions (hereafter PSWF, see, e.g., [44]). They can be extended as entire functions in and form a complete orthogonal system of for which we use the same notation. They are useful to carry interpolation and extrapolation (see, e.g., [40]) with Hilbertian techniques. In this paper, for all , plays the role of the Fourier transform in the definition of . The weight allows for larger classes than and noncompact . This is useful even if is compact when the researcher does not know a superset containing . The useful results on the corresponding SVD and a numerical algorithm to compute it are given in [22].

3.2. Sets of smooth and integrable functions

Define, for all and increasing, , , , , , and ,

and when we replace by , where

 (6) bm(t):=⟨F1st[f](t,⋅),φW,c(t)m⟩L2(W⊗p), θq,k(t):=⎛⎜⎝∑m∈Np0: |m|q=k|bm(t)|2⎞⎟⎠1/2.

The first inequality in the definition of defines the notion of smoothness for functions in analyzed in this paper. It involves a maximum of two terms, thus two inequalities: the first corresponds to smoothness in the first variable and the second to smoothness in the other variables. The additional inequality imposes integrability in the first variable. The asymmetry in the treatment of the first and remaining variables is due to the fact that, in the statistical problem, only the random slopes are multiplied by regressors which have limited variation and we make integrability assumptions in the first variable which are as mild as possible. The use of the Fourier transform to express smoothness in the first variable is classical. For the remaining variables, we choose a framework that allows for both functions with compact and noncompact support and work with the bases for . For functions with compact support, it is possible to use Fourier series and we make a comparison in Section B.4. The use of different bases for different values of is motivated by (5). Though the spaces are chosen for mathematical convenience, we analyze all types of smoothness. The smoothness being unknown anyway, we provide an adaptive estimator. We analyze two values of and show that the choice of the norm matters for the rates of convergence for supersmooth functions.

Remark 1.

The next model is related to (1) under Assumption 1 when is known:

 (7) dZ(t)=K[f](t,⋅)dt+σ√ndG(t),t∈R,

where plays the role of , is known, and is a complex two-sided cylindrical Gaussian process on . This means, for Hilbert-Schmidt from to a separable Hilbert space , is a Gaussian process in of covariance (see [17]). Taking , where , and are independent two-sided Brownian motions, the system of independent equations

 (8) Zm(t):=∫t0σW,c(s)mbm(s)ds+σ√nBm(t),t∈R,

where, and , is equivalent to (7). Because is small when is large or is small (see Lemma B.4), the estimator of Section 4.1 truncates large values of and does not rely on small values of but uses interpolation.

Remark 2.

[32] considers a Gaussian sequence model corresponding to (7), is the Radon transform, , is a two-sided cylindrical Wiener process, and is a weighted space of functions with support in the unit disk of for which has a SVD with a known rate of decay of the singular values.

3.3. Interpolation

Define, for all , the operator

 (9) Ia––,ϵ[f] :=∑m∈N0ρW[−1,1],a––ϵm(1−ρW[−1,1],a––m)ϵ⟨f,C1/ϵ[gW[−1,1],–aϵm]⟩L2(R∖(−ϵ,ϵ))C1/ϵ[gW[−1,1],a––ϵm]

on with domain . For all , is a distribution.

Proposition 3.

For all , we have and, for all , in and, for and all ,

 (10)

If , only relies on and on , so (9) provides an analytic formula to carry interpolation on of functions in . Else, (10) provides an upper bound on the error made by approximating by on when approximates outside

. We use interpolation when the variance of an initial estimator

of is large due to its values near 0 but is small and work with

 ∀t∈R, ˆf(t)=ˆf0(t)1l{|t|≥ϵ}+Ia––,ϵ[ˆf0](t)1l{|t|<ϵ},

in which case, (10) yields

 (11) ∥∥f−ˆf∥∥2L2(R)≤(1+2C(a––,ϵ))∥∥f−ˆf0∥∥2L2(R∖(−ϵ,ϵ))+2(1+C(a––,ϵ))∥∥f−Pa––[f]∥∥2L2(R).

When is compact, is taken such that . Else, goes to infinity so the second term in (11) goes to 0. is taken such that is constant because, due to (3.87) in [44], and (10) and (11) become useless. Then is constant and we set . When , we get and .

3.4. Risk

The risk of an estimator is the mean integrated squared error (MISE)

 RWn0(ˆfα,β,fα,β):=E[∥∥ˆfα,β−fα,β∥∥2L2(1⊗W⊗p)∣∣∣Gn0].

When and , it is , else,

 (12) E[∥∥ˆfα,β−fα,β∥∥2L2(Rp+1)∣∣∣Gn0]≤∥∥W−1∥∥pL∞(R)RWn0(ˆfα,β,fα,β).

We consider a risk conditional on for simplicity of the treatment of the random regressors with unknown law. We adopt the minimax approach and consider the supremum risk. The lower bounds involve a function (for rate) and take the form

 (13) ∃ν>0: lim––––n→∞infˆfα,β  supfα,β∈Hq,ϕ,ωw,W(l)∩DE[∥∥ˆfα,β−fα,β∥∥2L2(Rp+1)]≥νr(n).

When we replace by , by , and consider model (8), we refer to (13’); when we also replace by , we refer to (13”), where is the set of functions in such that is not arbitrarily concentrated close to 0: for all , .

4. Estimation

The sets of densities in the supremum risk and of estimators in this section depend on . The rates of convergence depend on via .

4.1. Estimator considered

For all , and such that for and for , a regularized inverse is obtained by:

1. for all , obtain a preliminary approximation of

 Fq,N,T,01(t,⋅) :=1l{|t|≤T}∑|m|q≤N(t)cm(t)σW,tx0mφW,tx0m, cm(t):=⟨F[fY|X=x0⋅](t),gW,tx0m⟩L2([−1,1]p),
2. for all , ,

3. .

To deal with the statistical problem, we carry (S.1)-(S.3) replacing by the estimator

 (14) ˆcm(t):=1nn∑j=1eitYjxp0ˆfδX|X(Xj)¯¯¯¯¯¯¯¯¯¯¯¯¯¯gW,tx0m(Xjx0)1l{Xj∈X},

where and is a trimming factor converging to zero with . This yields the estimators , , and . We use as a final estimator of which always has a smaller risk than (see [25, 48]). We use for the sample size required for an ideal estimator where is known to achieve the rate of the plug-in estimator. The upper bounds below take the form

 (15) 1r(ne)supfα,β∈Hq,ϕ,ωw,W(l,M)∩D, fX|X∈ERWn0(ˆfq,N,T,ϵα,β,fα,β)=Op(1).

When we use instead the restriction , we refer to (15’).

4.2. Logarithmic rates when ω is a power

The first result below involves, for all and , the inverse of which is such that, for all , is increasing.

Theorem 1.

Let , , , , , , for , and . (15) holds with in the following cases

1. , , , , and ,

2. , , , and

 ¯¯¯¯¯N(t)=ln(ne)2⎛⎜ ⎜⎝1l{|t|>π4Rx0}2σ+p−kq+π