    # Distribution free testing for linear regression. Extension to general parametric regression

Recently a distribution free approach for testing parametric hypotheses based on unitary transformations has been suggested in <cit.> and further studied in <cit.> and <cit.>. In this note we show that the transformation takes extremely simple form in distribution free testing of linear regression. Then we extend it to general parametric regression with vector-valued covariates.

## Authors

07/02/2019

### Specification testing in semi-parametric transformation models

In transformation regression models the response is transformed before f...
08/17/2021

### Testing Multiple Linear Regression Systems with Metamorphic Testing

Regression is one of the most commonly used statistical techniques. Howe...
10/05/2020

### Ke Li's lemma for quantum hypothesis testing in general von Neumann algebras

A lemma stated by Ke Li in <cit.> has been used in e.g.<cit.> for variou...
07/13/2019

### On linear regression in three-dimensional Euclidean space

The three-dimensional linear regression problem is a problem of finding ...
11/23/2020

### Sparse linear regression – CLuP achieves the ideal exact ML

In this paper we revisit one of the classical statistical problems, the ...
09/29/2021

### Assessing the goodness of fit of linear regression via higher-order least squares

We introduce a simple diagnostic test for assessing the goodness of fit ...
06/01/2020

### Note to "An efficient Data Structure for Lattice Operation"

This note is to publicly answer to a paper recently accepted to SWAT 202...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction. An illustrative example with linear regression

The situation we consider in this paper is that of the classical parametric regression: given a sequence of pairs of random variables

, where

is the response variable, while

is the explanatory variable, or covariate, of this , consider regression of on ,

 Yi=m(Xi)+ϵi.

We assume that, given covariates , the errors

are i.i.d, and have expected value zero and finite variance – for the sake of simplicity we assume this variance equal 1.

We are interested in the classical problem of testing that the regression function belongs to a specified parametric family of functions , which depend on a finite-dimensional parameter and which satisfy more or less usual regularity assumptions as functions of this .

Our aim is to describe a new method to build asymptotically distribution free theory for testing such hypothesis. More specifically, we will construct asymptotically distribution free version of the regression empirical process, so that functionals from this process, used as test statistics, will be asymptotically distribution free. The core of the method is based on the application of unitary operators as described more or less recently in

Khmaladze [2013, 2016] and studied in Roberts  and Nguyen .

Earlier, asymptotically distribution free transformation of regression empirical process was suggested in Khmaladze and Koul . For -dimensional covariates, the limit distribution of the transformed process was that of standard Brownian motion on . In this paper, the transformed process will converge to a standard projection of the standard Brownian motion on , and the transformation will take surprisingly simple form, convenient in everyday practice. As in Khmaladze and Koul , this transformation is connected with no loss of statistical information.

The shortest way to show how the method works is to consider the most simple linear regression model. That is, in

 Yi=Xiθ+ϵi,i=1,…,n,orinvectorform,Y=Xθ+ϵ, (1)

the covariates , and the coefficient are one-dimensional. On probabilistic nature of the covariates , we will make, practically, no assumptions. We only will use their empirical distribution function

 Fn(x)=1nn∑i=1I(Xi≤x)

and assume that as number of observed pairs increases it weakly converges to some limiting distribution – an assumption of ergodic nature. Whenever we use time transformation , we will also assume that is continuous. All expectations below will be conditional expectations given the vector of numbers .

Consider estimated errors, or residuals,

 ^ϵ=Y−X^θwith^θ=⟨Y,z⟩,

where is the normalised vector of covariates. The natural object to base a goodness of fit test upon is given by the partial sums process

 ^wn(x)=1√nn∑i=1^ϵiI(Xi≤x).

However, the distribution of the vector depends on covariates: its covariance matrix has the form

 E^ϵ^ϵT=I−zzT.

As to the limit in distribution for the process , it is a projection of some Brownian motion, but not the Brownian bridge. Its distribution remains dependent on behaviour of the covariates. The limit distribution of statistics based on this process, and in particular, its supremum, will not be easy to calculate.

However, consider new residuals obtained from by unitary transformation

 Ua,b=I−⟨a−b,⋅⟩1−⟨a,b⟩(a−b)

with -dimensional vectors and of unit norm: . If we take . This operator in unitary, it maps into and into , and it maps any vector , orthogonal to and , to itself, see, e.g., Khmaladze , Sec. 2. Now choose and choose equal , the vector not depending on covariates at all. Since the vector of residuals is orthogonal to the vector , we obtain:

 ^e=^ϵ−⟨^ϵ,r⟩1−⟨z,r⟩(r−z).

These new residuals have covariance matrix

 E^e^eT=I−rrT.

This would be the covariance matrix of the residuals in the problem of testing

 Yi=θ+ϵi,i=1,2,…,n, (2)

which is completely free from covariates. Yet, the transformation of to is one-to-one and therefore contain the same “statistical information”, whichever way we measure it, as . One could say that the problem of testing linear regression (1) and testing (2) is the same problem.

The partial sum process based on the new covariates,

 ^wn,e(x)=1√nn∑i=1^eiI(Xi≤x),

will converge in distribution, with time transformation , to standard Brownian bridge. Therefore, limit distribution for all classical statistics will be free from covariates and known. Figure 1: The smooth line is Kolmogorov distribution function. The two other ones are simulated distributions of maxx|^wn,e(x)| for two entirely different behaviour of covariates. In one case Xi-s have uniform distribution on [0,2]while in the other they have Gaussian distribution N(1,2). 200 replications of samples of size n=200.

Asymptotically distribution free tests, even if only for the case of linear regression, have been of main interest from long ago. To achieve this distribution free-ness different forms of residuals have been suggested, various decompositions of , especially when covariates are multidimensional, have been studied and approximations for quadratic forms from have been developed. Assumption of normality, arbitrary as it is in many cases, has been made more or less casually. If one is allowed somewhat free speech, one could say that a mathematical lace has been created. Good source for this material is the book Cook, Weisberg . In dry residue. only the chi-square tests have been obtained. Distribution free forms of other classical statistics were never considered and constructed. We refer to McCullagh, Nelder  for much of the existing theory for linear models. The most recent review on goodness of fit problems in regression which we know of is Gonzales Manteiga, Crujeiras .

Note that the initial regression process of this paper, not yet asymptotically distribution free, is different from what was used in previous work, including relatively recent ones. Although partial sum processes, like , form one of the main objects of asymptotic theory, it is often that a different form of such processes is considered, one simple example of which would be

 1√nn∑i=1(Xi−¯Xn)I(^ϵi≤x), (3)

(see more sophisticated form of the weight function in recent paper Chown, Müller ). Here the scanning over the values of the residuals is used. This is very natural way of scanning when the statistical problems considered pertain to distribution of errors. An example, studied in well known papers Dette, Munk , Dette, Hetzler , Dette et al  and loc.cit. Chown, Müller , is the problem of testing heterogeneity of errors. The same scanning is basically unavoidable in study of distribution of i.i.d. errors, cf. Koul et al 

, and in analysis of the distribution of innovations in autoregression models, see

Müller et al .

In our current situation of testing the form of regression function, it is a natural wish to see, in the case there is a deviation from the model, for what region of values of the covariate the deviation takes place, and scanning in -s will allow this. Even in the simple case when the covariate is just discrete time, taking values , it would be strange not to examine the sequence , in this time, but instead look on the order statistics based on them, which scanning as in (3) would imply. These considerations motivate the form of the regression process and . To make the illustrative example of this section more of immediate practical use and to explain better the asymptotic behaviour of the regression empirical process, in the next Section 2 we consider the general form of one-dimensional linear regression. In the following Section 3 we consider general parametric regression. In this case the time transformation, considered in (iii) of the Proposition 2 below again leads to distribution free-ness if is continuous. If is discrete, then the method suggested in Khmaladze , Sec. 2, can be easily used. In Section 4 we consider multidimensional s. Transformation fo to will not change, but to standardise distribution of regressors one could use normalisation by , where is an estimator of the density of , cf., e.g., Einmahl, Khmaladze , Can et al . Here, however, we consider an approach borrowed from the theory of optimal transportation, or Monge - Kantorovich transportation problem, see, e.g., Villani . Very interesting probabilistic/statistical applications of this theory have been recently given in del Bario et al  and Segers .

## 2 General linear regression on R

Consider the standard linear regression on the real line,

 Yi=θ0+Xiθ1+ϵi,i=1,…,n,or\,Y=θ01+Xθ1+ϵ, (4)

The here denotes a vector with all coordinates equal to the number . Instead of (4) consider its slightly modified and more convenient form

 Yi=θ0+(Xi−¯X)θ1+ϵi,i=1,…,n,or\, in\, vector\, form, (5) Y=θ01+(X−¯X1)θ1+ϵ,

The least square estimations of and are

 ^θ0=1nn∑j=1Yjand^θ1=1∑nj=1(Xj−¯X)2n∑i=1Yj(Xj−¯X).

Using again notation and notation

 ~z=1√∑nj=1(Xj−¯X)2(X−¯X),

for normalised vector of centered covariates, one can write the residuals as

 ^ϵ=Y−^θ01−^θ1(X−¯X1)

or in more succinct form

 ^ϵ=Y−⟨Y,r⟩r−⟨Y,~z⟩~z.

Substitution of the linear regression model (5) for produces representation of the vector of residuals through the vector of errors :

 ^ϵ=ϵ−⟨ϵ,r⟩r−⟨ϵ,~z⟩~z. (6)

This represents as projection of orthogonal to and .

From this it follows that the covariance matrix of is

 E^ϵ^ϵT=I−rrT−~z~zT,

and thus it still depends on the values of the covariates. The limit distribution of the regression process with these residuals,

 ^wn(x)=1√nn∑i=1^ϵiI(Xi≤x),

will therefore have limit distribution which depends on .

It is possible to say more about the geometric structure of and its limiting process, and namely that the limiting process will be a double projection of Brownian motion orthogonal to the functions and

 H(x)=∫xh(y)dF(y), withh(x)=x−∫ydF(y)√∫(z−∫ydF(y))2dF(z).

Here one can think of as a continuous time “trace” of .

To show this structure of denote the vector with coordinates . Then we can write

 ^wn(x)=1√n⟨^ϵ,Ix⟩=1√n[⟨ϵ,Ix⟩−⟨ϵ,r⟩⟨r,Ix⟩−⟨ϵ,~z⟩⟨~z,Ix⟩].

For the first term on the right hand side, considered as a process in and denoted , we can see that

 wn(x)=1√n⟨ϵ,Ix⟩=1√nn∑i=1ϵiI(Xi≤x) (7)

is the process of partial sums of i.i.d. random variables and while . Therefore, converges in distribution to Brownian motion in time , i.e. . Now consider the second term:

The third term produces the following expression:

 1√nn∑j=1ϵj(Xj−¯X)1∑nj=1(Xj−¯X)2n∑i=1(Xi−¯X)I(Xi≤x) =∫(y−¯X)wn(dy)1∫(y−¯X)2dFn(y)∫x(y−¯X)dFn(y) =∫hn(y)wn(dy)∫xhn(y)dFn(y),

where

 hn(x)=x−¯X√∫(y−¯X)2dFn(y).

This function, obviously, has unit -norm and is orthogonal to functions and . Overall, we see that

 ^wn(x)=wn(x)−wn(∞)Fn(x)−∫hn(y)wn(dy)∫xhn(y)dFn(y), (8)

and the right hand side of (8) is the orthogonal projector of , which annihilates and . As the consequence of this, if , then is the corresponding projection of the Brownian motion .

What we propose now is, again, to replace the residuals by another residuals, , constructed as their unitary transformation. As a preliminary step, assume that the covariates are listed in increasing order,

. One can assume this without loss of generality: even if it will entail re-shuffling of our initial pairs of observations, the probability measure we work under will not change, because the re-shuffled errors will still be independent from permuted

and will still form an i.i.d. sequence.

Now introduce another vector , different from , which also has unit norm and is orthogonal to . Define

 ^e=U~z,~r^ϵ=^ϵ−⟨^ϵ,~r−~z⟩1−⟨z,r⟩(~r−~z)=^ϵ−⟨^ϵ,~r⟩1−⟨~z,~r⟩(~r−~z),

where the second equality is true because the vector is orthogonal to the vector , see (6). Thus calculation of new residuals is as simple as in the previous case of (1).

Let us summarise properties of in the following proposition. In this, for transition to the limit when , it is natural to assume that can be represented through some piece-wise continuous function on :

 ~ri=1√n~r(in), (9)

in which case we have convergence

 1√nnt∑i=1~ri=1nnt∑i=1~r(in)→∫t0~r(s)ds=Q(t)

and

 nt∑i=1~r2i=1nnt∑i=1~r2(in)→∫t0~r2(s)ds.

Orthogonality of the vector to the vector implies orthogonality of the function to functions equal constant, or . For example, can be chosen as

 ~ri=√12n[in−n+12n]. (10)
###### Proposition 1.

(i) Covariance matrix of is

 E^e^eT=I−rrT−~r~rT

and therefore does not incorporate covariates as soon as does not incorporate .

(ii) If (9) is true then the regression empirical process based on ,

 ^wn,e(x)=1√nn∑i=1^eiI(Xi≤x)

has the covariance function

 E^wn,e(x)^wn,e(y)=Fn(min(x,y))−Fn(x)Fn(y)−Qn(Fn(x))Qn(Fn(y))+O(1/n),

where . In the case of (10)

 Q(Fn(x))∼−√3Fn(x)(1−Fn(x)),n→∞.

(iii) As a corollary of (ii), the process , with change of time , converges in distribution to projection of standard Brownian motion on orthogonal to functions and .

The main step in the proof of is to express through :

 U~z,~r^ϵ =U~z,~rϵ−⟨ϵ,r⟩U~z,~rr−⟨ϵ,~z⟩U~z,~r~z =U~z,~rϵ−⟨ϵ,r⟩r−⟨ϵ,~z⟩~r,

where the second equality is correct because and by the definition of . Therefore

 ^e=U~z,~r^ϵ=ϵ−⟨ϵ,~r⟩1−⟨~z,~r⟩(~r−~z)−⟨ϵ,r⟩r−⟨ϵ,~z⟩~r.

Calculation of the covariance matrix of the right hand side is now not difficult using shorthand formulas and . After some algebra we obtain the expression given in (i).

To show (ii) use vector notation for :

 E^wn,e(x)^wn,e(y)=1nE⟨Ix,^e⟩⟨^e,Iy⟩=1nITx(I−rrT−~r~rT)Iy

Opening the brackets in the last expression one can find that

 1n⟨Ix,Iy⟩=Fn(min(x,y))and1n⟨Ix,r⟩⟨Iy,r⟩=Fn(x)Fn(y),

while

 1n⟨Ix,~r⟩⟨Iy,~r⟩ =1nn∑i=1~r(in)I(Xi≤x)1nn∑i=1~r(in)I(Xi≤y) =1nnFn(x)∑i=1~r(in)1nnFn(y)∑i=1~r(in)=Qn(Fn(x))Qn(Fn(y))

which proves (ii).

The statement (iii) follows if we note that the covariance function of in time converges to , and that orthogonality of function to the function identically equal 1 makes the last expression the covariance of the Gaussian process

 w(t)−tw(1)−Q(t)∫10~r(s)w(ds),

which indeed is the projection described in (iii).

In both regression models (1) and (5) the process turns out to be a projection of a Brownian motion, but for different values of covariates these projections are different. However, it is geometrically clear that it should be possible to rotate one projection into another, and this another into still another one, thus creating a class of equivalent projections – those which can be mapped into each other. Then one can choose a single representative in each equivalence class, call it standard, and rotate any other projection into this standard one. What was done in this and the previous section was that we selected two standard projections and constructed the rotation of the other ones into these two.

The usefulness of this approach depends on how practically simple the rotation will be. For us, the transformations of into looks very simple.

Finally, note that the model (5) includes two estimated parameters while the model (1) – only one. However, since the vector is already “standard”, independent from covariates, there is no need to “rotate” it to any other vector. Therefore in both cases one-dimensional rotation is sufficient. Situation when one needs to rotate several vectors at once, as well as general form of parametric regression, will be considered in the next Section 3.

## 3 General parametric regression

Now consider testing regression model

 Yi=mθ(Xi)+ϵi,i=1,…,,n, or in vector form% ,Y=mθ(X)+ϵ, (11)

where denotes a vector with coordinates , and is regression function, depending on -dimensional parameter . We will assume some regularity of with respect to , namely that is continuously differentiable in . Obvious example when this condition is true is given by polynomial regression

 mθ(x)=θ1p1(x)+θ2p2(x)+⋯+θdpd(x),

where may form a system of (orthogonal) polynomials, or splines (see, e.g., Harrell , Sec.2.4.3), or trigonometric polynomials. There certainly are also many examples where is not linear in .

Now denote

 ˙mθ(x)=(∂∂θ1mθ(x),…,∂∂θdmθ(x))T

a -dimensional vector-function of the partial derivatives. Then is -matrix, with rows and columns. We assume that for every coordinates of are linearly independent as functions of

, which heuristically means that the model does not include unnecessary parameters.

Let now denote the least square estimator of , which is an appropriate solution of the least squares’ equation

 n∑i=1˙m^θ(Xi)[Yi−m^θ(Xi)]=0.

Without digressing to exact justification (which can be found, e.g., in Bates, Watts ) assume that Taylor expansion in is valid and that together with normalization by it leads to

 1√nn∑i=1˙mθ(Xi)[Yi−mθ(Xi)]−Rn√n(^θ−θ)+ρn=0

with a non-degenerate -matrix ,

 Rn=1nn∑i=1˙mθ(Xi)˙mTθ(Xi)=∫˙mθ(x)˙mTθ(x)dFn(x),

and -dimensional vector of residuals , such that . Below for the terms asymptotically negligible in this sense we will use notation . From the previous display we obtain asymptotic representation for :

 √n(^θ−θ)=R−1n1√nn∑i=1˙mθ(Xi)[Yi−mθ(Xi)]+oP(1).

As the final step, expand the differences in up to linear term and substitute the expression for to get

 Yi−m^θ(Xi)=Yi−mθ(Xi)−˙mTθ(Xi)R−1n1nn∑j=1˙mθ(Xj)[Yj−mθ(Xj)]+oP(1)

or

 ^ϵi=ϵi−˙mTθ(Xi)R−1n1nn∑j=1˙mθ(Xj)ϵj+oP(1).

In vector form this becomes

 ^ϵ=ϵ−˙mTθR−1n1n⟨˙mθ,ϵ⟩+oP(1), (12)

an expression directly analogous to (6). It also describes the vector of residuals as being, asymptotically, projection of the vector of errors , parallel to -dimensional vectors of derivatives

 (∂∂θ1mθ(Xi))ni=1,…,(∂∂θdmθ(Xi))ni=1.

It will be notationally simpler, while computationally not difficult, to change these linearly independent vectors to orthonormal vectors. Namely, introduce the functions

 μθk(x)=R−1/2n∂∂θkmθ(x),k=1,…,d,

and then the vectors

 μθk,i=1√nμθk(Xi),i=1,…,n. (13)

The two notations are convenient each in its place: as a vector in will be useful in expressions like (14), and as a function in will be useful in integral expressions like (3). Their respective norms are equal:

 n∑i=1μ2θk,i=∫μ2θk(x)dFn(x).

Which of these two objects we use will be visible in notation and clear from the context.

Now we can write (12) as

 ^ϵ=ϵ−d∑k=1μTθk⟨μθk,ϵ⟩+oP(1), (14)

where the leading term on the right hand side is the projection of orthogonal to vectors . As a consequence, one can show that the following analogue of the representation (8) is true:

 ^wn(x) =1√nn∑i=1[Yi−m^θ(Xi)]I(Xi≤x) =wn(x)−d∑k=1∫z≤xμθk(z)dFn(z)∫μθk(z)wn(dz)+oP(1). (15)

This, again, describes as asymptotically projection of orthogonal to the functions . We are ready to describe rotation of this projection to another, standard, projection, and of to a vector of another residuals.

With some freedom of speech, we say that one can choose these new residuals in any way we wish; for example, choose them independent of any covariates. In particular, let be a function on , identically equal , and with this let vectors be defined as , where the system of functions is such that

 1nn∑i=1rk(in)rl(in)=δk,l,k,l=1,…,d.

If we derive a unitary operator , which maps orthonormal vectors into vectors , then this operator will map into , and the covariance matrix of these new residuals will be defined solely by or .

As a side and rather inconsequential remark we note that it would be immediate to choose orthonormal polynomials on , i.e. such that

 ∫10rk(s)rl(s)ds=δk,l,

which are continuous and bounded functions. Such polynomials will not satisfy the orthogonality condition in the previous display, but will require small corrections, asymptotically negligible for . If we insert these corrections in our notation it will make the text more complicated without opening any new feature of the transformation we want to discuss. Therefore in notations we will identify orthogonal polynomials in continuous time with those, orthonormal on the grid .

It is essential that the structure of allows convenient handling. We present it here as a product of one-dimensional unitary operators. This allows coding of

in a loop, and was tried for the case of contingency tables with about 30-dimensional parameter in

Nguyen .

Suppose in one-dimensional unitary operator we choose and and apply the resulting operator to vector :

 Uμθ1,r1r2=~r2.

Then the product

 K2=Uμθ2,~r2×Uμθ1,r1

is unitary operator which maps vectors to vectors and vice versa, and leaves vectors orthogonal to these four vectors unchanged. For a general , define as

 Kk−1rk=~rk,k=2,…,d.
###### Lemma 1.

The product

 Kd=Uμθd,~rd×⋯×Uμθ1,r1

is the unitary operator which maps to and vice versa, and leaves vectors orthogonal to and unchanged.

The proof of this lemma was given, e.g., in Khmaladze , section 3.4. It may be of independent interest for statistics of directional data, when explicit expression for rotations is needed. Therefore, for reader’s convenience, at the end of this section we give an essentially shorter proof.

Thus, in proposition below we denote

 ^e=Kd^ϵ, (16)

and recall that -s are numbered in increasing order. We also say

 E^ϵ^ϵT∼I−d∑k=1μθkμTθk

in the sense that for any sequence of -vectors , such that

 E⟨bn,ϵ⟩2∼⟨bn,bn⟩−d∑k=1⟨bn,μθk⟩2,n→∞.

This notion of equivalence is used in the proposition below.

###### Proposition 2.

Suppose the regression function is regular, in the sense that, for every , the matrix is of full rank and converges to a matrix of full rank, and (14) is true. Suppose the functions are continuous and bounded on .Then

(i) for the covariance matrix of residuals the following is true:

 E^e^eT∼I−d∑j=1rkrTk,n→∞;

(ii) for the empirical regression process, based on residuals of (16),

 ^wn,e(x)=1√nn∑i=1^eiI(Xi≤x),

the following convergence of the covariance function is true:

 E^wn,e(x)^wn,e(y)→F(min(x,y))−d∑j=1Qk(F(x))Qk(F(y)),asn→∞,

where ;
moreover,
(iii) the process , with time change converges in distribution to projection of standard Brownian motion on orthogonal to functions .

To prove (i) we do not need the explicit form of the operator , and instead note that according to (14), up to asymptotically negligible term, is projection of , orthogonal to collection of -vectors . According to the lemma above, these vectors are mapped by operator to -vectors , and the operator is unitary. Therefore the vector will be mapped into the vector which, up to asymptotically negligible term, will be projection of orthogonal to :

 ^e=ϵ−d∑k=1rk⟨rk,ϵ⟩+oP(1). (17)

And the covariance matrix of this vector is the expression given in (i).

To prove (ii), replace by its main term in (17) in the expected value

 E^wn,e(x)^wn,e(y)=1nE⟨Ix,^e⟩⟨^e,Iy⟩∼1nITx(I−d∑k=1rkrTk)Iy.

Here, since every is continuous and bounded,

 1√nITxrk=1nn∑i=1rk(in)I(Xi≤x)∼∫z≤xrk(F(z))dF(z).

Statement (iii) of convergence in distribution follows not from unitarity property of as such, but from simplicity of its structure, reflected by (17). We have

 ^wn,e(x)∼1√n⟨Ix,ϵ−d∑j=1rj⟨rj,ϵ⟩⟩=1√n⟨Ix,ϵ⟩−1√nd∑k=1⟨Ix,rk⟩⟨rk,ϵ⟩

The first inner product on the right side, denoted in (7), converges in distribution to -Brownian motion. Expression for we considered above, while

 ⟨rj,ϵ⟩=1√nn∑i=1rk(in)ϵi=1√nn∑i=1rk(Fn(Xi))ϵi=∫rk(Fn(x))wn(dx).

Thus, overall representation of