1 Introduction. An illustrative example with linear regression
The situation we consider in this paper is that of the classical parametric regression: given a sequence of pairs of random variables, where
is the response variable, whileis the explanatory variable, or covariate, of this , consider regression of on ,
We assume that, given covariates , the errors
are i.i.d, and have expected value zero and finite variance – for the sake of simplicity we assume this variance equal 1.
We are interested in the classical problem of testing that the regression function belongs to a specified parametric family of functions , which depend on a finite-dimensional parameter and which satisfy more or less usual regularity assumptions as functions of this .
Our aim is to describe a new method to build asymptotically distribution free theory for testing such hypothesis. More specifically, we will construct asymptotically distribution free version of the regression empirical process, so that functionals from this process, used as test statistics, will be asymptotically distribution free. The core of the method is based on the application of unitary operators as described more or less recently inKhmaladze [2013, 2016] and studied in Roberts  and Nguyen .
Earlier, asymptotically distribution free transformation of regression empirical process was suggested in Khmaladze and Koul . For -dimensional covariates, the limit distribution of the transformed process was that of standard Brownian motion on . In this paper, the transformed process will converge to a standard projection of the standard Brownian motion on , and the transformation will take surprisingly simple form, convenient in everyday practice. As in Khmaladze and Koul , this transformation is connected with no loss of statistical information.
The shortest way to show how the method works is to consider the most simple linear regression model. That is, in
the covariates , and the coefficient are one-dimensional. On probabilistic nature of the covariates , we will make, practically, no assumptions. We only will use their empirical distribution function
and assume that as number of observed pairs increases it weakly converges to some limiting distribution – an assumption of ergodic nature. Whenever we use time transformation , we will also assume that is continuous. All expectations below will be conditional expectations given the vector of numbers .
Consider estimated errors, or residuals,
where is the normalised vector of covariates. The natural object to base a goodness of fit test upon is given by the partial sums process
However, the distribution of the vector depends on covariates: its covariance matrix has the form
As to the limit in distribution for the process , it is a projection of some Brownian motion, but not the Brownian bridge. Its distribution remains dependent on behaviour of the covariates. The limit distribution of statistics based on this process, and in particular, its supremum, will not be easy to calculate.
However, consider new residuals obtained from by unitary transformation
with -dimensional vectors and of unit norm: . If we take . This operator in unitary, it maps into and into , and it maps any vector , orthogonal to and , to itself, see, e.g., Khmaladze , Sec. 2. Now choose and choose equal , the vector not depending on covariates at all. Since the vector of residuals is orthogonal to the vector , we obtain:
These new residuals have covariance matrix
This would be the covariance matrix of the residuals in the problem of testing
which is completely free from covariates. Yet, the transformation of to is one-to-one and therefore contain the same “statistical information”, whichever way we measure it, as . One could say that the problem of testing linear regression (1) and testing (2) is the same problem.
The partial sum process based on the new covariates,
will converge in distribution, with time transformation , to standard Brownian bridge. Therefore, limit distribution for all classical statistics will be free from covariates and known.
Asymptotically distribution free tests, even if only for the case of linear regression, have been of main interest from long ago. To achieve this distribution free-ness different forms of residuals have been suggested, various decompositions of , especially when covariates are multidimensional, have been studied and approximations for quadratic forms from have been developed. Assumption of normality, arbitrary as it is in many cases, has been made more or less casually. If one is allowed somewhat free speech, one could say that a mathematical lace has been created. Good source for this material is the book Cook, Weisberg . In dry residue. only the chi-square tests have been obtained. Distribution free forms of other classical statistics were never considered and constructed. We refer to McCullagh, Nelder  for much of the existing theory for linear models. The most recent review on goodness of fit problems in regression which we know of is Gonzales Manteiga, Crujeiras .
Note that the initial regression process of this paper, not yet asymptotically distribution free, is different from what was used in previous work, including relatively recent ones. Although partial sum processes, like , form one of the main objects of asymptotic theory, it is often that a different form of such processes is considered, one simple example of which would be
(see more sophisticated form of the weight function in recent paper Chown, Müller ). Here the scanning over the values of the residuals is used. This is very natural way of scanning when the statistical problems considered pertain to distribution of errors. An example, studied in well known papers Dette, Munk , Dette, Hetzler , Dette et al  and loc.cit. Chown, Müller , is the problem of testing heterogeneity of errors. The same scanning is basically unavoidable in study of distribution of i.i.d. errors, cf. Koul et al 
, and in analysis of the distribution of innovations in autoregression models, seeMüller et al .
In our current situation of testing the form of regression function, it is a natural wish to see, in the case there is a deviation from the model, for what region of values of the covariate the deviation takes place, and scanning in -s will allow this. Even in the simple case when the covariate is just discrete time, taking values , it would be strange not to examine the sequence , in this time, but instead look on the order statistics based on them, which scanning as in (3) would imply. These considerations motivate the form of the regression process and . To make the illustrative example of this section more of immediate practical use and to explain better the asymptotic behaviour of the regression empirical process, in the next Section 2 we consider the general form of one-dimensional linear regression. In the following Section 3 we consider general parametric regression. In this case the time transformation, considered in (iii) of the Proposition 2 below again leads to distribution free-ness if is continuous. If is discrete, then the method suggested in Khmaladze , Sec. 2, can be easily used. In Section 4 we consider multidimensional s. Transformation fo to will not change, but to standardise distribution of regressors one could use normalisation by , where is an estimator of the density of , cf., e.g., Einmahl, Khmaladze , Can et al . Here, however, we consider an approach borrowed from the theory of optimal transportation, or Monge - Kantorovich transportation problem, see, e.g., Villani . Very interesting probabilistic/statistical applications of this theory have been recently given in del Bario et al  and Segers .
2 General linear regression on
Consider the standard linear regression on the real line,
The here denotes a vector with all coordinates equal to the number . Instead of (4) consider its slightly modified and more convenient form
The least square estimations of and are
Using again notation and notation
for normalised vector of centered covariates, one can write the residuals as
or in more succinct form
Substitution of the linear regression model (5) for produces representation of the vector of residuals through the vector of errors :
This represents as projection of orthogonal to and .
From this it follows that the covariance matrix of is
and thus it still depends on the values of the covariates. The limit distribution of the regression process with these residuals,
will therefore have limit distribution which depends on .
It is possible to say more about the geometric structure of and its limiting process, and namely that the limiting process will be a double projection of Brownian motion orthogonal to the functions and
Here one can think of as a continuous time “trace” of .
To show this structure of denote the vector with coordinates . Then we can write
For the first term on the right hand side, considered as a process in and denoted , we can see that
is the process of partial sums of i.i.d. random variables and while . Therefore, converges in distribution to Brownian motion in time , i.e. . Now consider the second term:
The third term produces the following expression:
This function, obviously, has unit -norm and is orthogonal to functions and . Overall, we see that
and the right hand side of (8) is the orthogonal projector of , which annihilates and . As the consequence of this, if , then is the corresponding projection of the Brownian motion .
What we propose now is, again, to replace the residuals by another residuals, , constructed as their unitary transformation. As a preliminary step, assume that the covariates are listed in increasing order,
. One can assume this without loss of generality: even if it will entail re-shuffling of our initial pairs of observations, the probability measure we work under will not change, because the re-shuffled errors will still be independent from permutedand will still form an i.i.d. sequence.
Now introduce another vector , different from , which also has unit norm and is orthogonal to . Define
Let us summarise properties of in the following proposition. In this, for transition to the limit when , it is natural to assume that can be represented through some piece-wise continuous function on :
in which case we have convergence
Orthogonality of the vector to the vector implies orthogonality of the function to functions equal constant, or . For example, can be chosen as
(i) Covariance matrix of is
and therefore does not incorporate covariates as soon as does not incorporate .
(ii) If (9) is true then the regression empirical process based on ,
has the covariance function
where . In the case of (10)
(iii) As a corollary of (ii), the process , with change of time , converges in distribution to projection of standard Brownian motion on orthogonal to functions and .
The main step in the proof of is to express through :
where the second equality is correct because and by the definition of . Therefore
Calculation of the covariance matrix of the right hand side is now not difficult using shorthand formulas and . After some algebra we obtain the expression given in (i).
To show (ii) use vector notation for :
Opening the brackets in the last expression one can find that
which proves (ii).
The statement (iii) follows if we note that the covariance function of in time converges to , and that orthogonality of function to the function identically equal 1 makes the last expression the covariance of the Gaussian process
which indeed is the projection described in (iii).
In both regression models (1) and (5) the process turns out to be a projection of a Brownian motion, but for different values of covariates these projections are different. However, it is geometrically clear that it should be possible to rotate one projection into another, and this another into still another one, thus creating a class of equivalent projections – those which can be mapped into each other. Then one can choose a single representative in each equivalence class, call it standard, and rotate any other projection into this standard one. What was done in this and the previous section was that we selected two standard projections and constructed the rotation of the other ones into these two.
The usefulness of this approach depends on how practically simple the rotation will be. For us, the transformations of into looks very simple.
Finally, note that the model (5) includes two estimated parameters while the model (1) – only one. However, since the vector is already “standard”, independent from covariates, there is no need to “rotate” it to any other vector. Therefore in both cases one-dimensional rotation is sufficient. Situation when one needs to rotate several vectors at once, as well as general form of parametric regression, will be considered in the next Section 3.
3 General parametric regression
Now consider testing regression model
where denotes a vector with coordinates , and is regression function, depending on -dimensional parameter . We will assume some regularity of with respect to , namely that is continuously differentiable in . Obvious example when this condition is true is given by polynomial regression
where may form a system of (orthogonal) polynomials, or splines (see, e.g., Harrell , Sec.2.4.3), or trigonometric polynomials. There certainly are also many examples where is not linear in .
a -dimensional vector-function of the partial derivatives. Then is -matrix, with rows and columns. We assume that for every coordinates of are linearly independent as functions of
, which heuristically means that the model does not include unnecessary parameters.
Let now denote the least square estimator of , which is an appropriate solution of the least squares’ equation
Without digressing to exact justification (which can be found, e.g., in Bates, Watts ) assume that Taylor expansion in is valid and that together with normalization by it leads to
with a non-degenerate -matrix ,
and -dimensional vector of residuals , such that . Below for the terms asymptotically negligible in this sense we will use notation . From the previous display we obtain asymptotic representation for :
As the final step, expand the differences in up to linear term and substitute the expression for to get
In vector form this becomes
an expression directly analogous to (6). It also describes the vector of residuals as being, asymptotically, projection of the vector of errors , parallel to -dimensional vectors of derivatives
and then the vectors
The two notations are convenient each in its place: as a vector in will be useful in expressions like (14), and as a function in will be useful in integral expressions like (3). Their respective norms are equal:
Which of these two objects we use will be visible in notation and clear from the context.
Now we can write (12) as
where the leading term on the right hand side is the projection of orthogonal to vectors . As a consequence, one can show that the following analogue of the representation (8) is true:
This, again, describes as asymptotically projection of orthogonal to the functions . We are ready to describe rotation of this projection to another, standard, projection, and of to a vector of another residuals.
With some freedom of speech, we say that one can choose these new residuals in any way we wish; for example, choose them independent of any covariates. In particular, let be a function on , identically equal , and with this let vectors be defined as , where the system of functions is such that
If we derive a unitary operator , which maps orthonormal vectors into vectors , then this operator will map into , and the covariance matrix of these new residuals will be defined solely by or .
As a side and rather inconsequential remark we note that it would be immediate to choose orthonormal polynomials on , i.e. such that
which are continuous and bounded functions. Such polynomials will not satisfy the orthogonality condition in the previous display, but will require small corrections, asymptotically negligible for . If we insert these corrections in our notation it will make the text more complicated without opening any new feature of the transformation we want to discuss. Therefore in notations we will identify orthogonal polynomials in continuous time with those, orthonormal on the grid .
It is essential that the structure of allows convenient handling. We present it here as a product of one-dimensional unitary operators. This allows coding of
in a loop, and was tried for the case of contingency tables with about 30-dimensional parameter inNguyen .
Suppose in one-dimensional unitary operator we choose and and apply the resulting operator to vector :
Then the product
is unitary operator which maps vectors to vectors and vice versa, and leaves vectors orthogonal to these four vectors unchanged. For a general , define as
is the unitary operator which maps to and vice versa, and leaves vectors orthogonal to and unchanged.
The proof of this lemma was given, e.g., in Khmaladze , section 3.4. It may be of independent interest for statistics of directional data, when explicit expression for rotations is needed. Therefore, for reader’s convenience, at the end of this section we give an essentially shorter proof.
Thus, in proposition below we denote
and recall that -s are numbered in increasing order. We also say
in the sense that for any sequence of -vectors , such that
This notion of equivalence is used in the proposition below.
Suppose the regression function is regular, in the sense that, for every , the matrix is of full rank and converges to a matrix of full rank, and (14) is true. Suppose the functions are continuous and bounded on .Then
(i) for the covariance matrix of residuals the following is true:
(ii) for the empirical regression process, based on residuals of (16),
the following convergence of the covariance function is true:
(iii) the process , with time change converges in distribution to projection of standard Brownian motion on orthogonal to functions .
To prove (i) we do not need the explicit form of the operator , and instead note that according to (14), up to asymptotically negligible term, is projection of , orthogonal to collection of -vectors . According to the lemma above, these vectors are mapped by operator to -vectors , and the operator is unitary. Therefore the vector will be mapped into the vector which, up to asymptotically negligible term, will be projection of orthogonal to :
And the covariance matrix of this vector is the expression given in (i).
To prove (ii), replace by its main term in (17) in the expected value
Here, since every is continuous and bounded,
Statement (iii) of convergence in distribution follows not from unitarity property of as such, but from simplicity of its structure, reflected by (17). We have
The first inner product on the right side, denoted in (7), converges in distribution to -Brownian motion. Expression for we considered above, while
Thus, overall representation of