Regression with genuinely functional errors-in-covariates

12/12/2017
by   Anirvan Chakraborty, et al.
EPFL
0

Contamination of covariates by measurement error is a classical problem in multivariate regression, where it is well known that failing to account for this contamination can result in substantial bias in the parameter estimators. The nature and degree of this effect on statistical inference is also understood to crucially depend on the specific distributional properties of the measurement error in question. When dealing with functional covariates, measurement error has thus far been modelled as additive white noise over the observation grid. Such a setting implicitly assumes that the error arises purely at the discrete sampling stage, otherwise the model can only be viewed in a weak (stochastic differential equation) sense, white noise not being a second-order process. Departing from this simple distributional setting can have serious consequences for inference, similar to the multivariate case, and current methodology will break down. In this paper, we consider the case when the additive measurement error is allowed to be a valid stochastic process. We propose a novel estimator of the slope parameter in a functional linear model, for scalar as well as functional responses, in the presence of this general measurement error specification. The proposed estimator is inspired by the multivariate regression calibration approach, but hinges on recent advances on matrix completion methods for functional data in order to handle the nontrivial (and unknown) error covariance structure. The asymptotic properties of the proposed estimators are derived. We probe the performance of the proposed estimator of slope using simulations and observe that it substantially improves upon the spectral truncation estimator based on the erroneous observations, i.e., ignoring measurement error. We also investigate the behaviour of the estimators on a real dataset on hip and knee angle curves during a gait cycle.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 22

06/14/2018

Regression with Functional Errors-in-Predictors: A Generalized Method-of-Moments Approach

Functional regression is an important topic in functional data analysis....
09/17/2018

Functional Measurement Error in Functional Regression

Measurement error is an important problem that has not been very well st...
05/20/2019

Raking and Regression Calibration: Methods to Address Bias from Correlated Covariate and Time-to-Event Error

Medical studies that depend on electronic health records (EHR) data are ...
06/26/2019

Rerandomization and Regression Adjustment

Randomization is a basis for the statistical inference of treatment effe...
06/26/2020

Prediction in polynomial errors-in-variables models

A multivariate errors-in-variables (EIV) model with an intercept term, a...
11/18/2019

Does Regression Approximate the Influence of the Covariates or Just Measurement Errors? A Model Validity Test

A criterion is proposed for testing hypothesis about the nature of the e...
08/16/2021

Statistical inference for the slope parameter in functional linear regression

In this paper we consider the linear regression model Y =S X+ε with func...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Background

Measurement error at the level of a covariate can adversely affect the quality of a regression estimator, if not properly accounted for, and this problem is by now well-studied and understood in the context of multivariate designs (see [10] and [4] for detailed expositions). The difficulties associated with measurement error are typically subtle, and can introduce identifiability issues, which may partly explain why [4]

refer to the inferential/statistical problems arising due to measurement errors in covariates as a

double/triple whammy

of measurement error. Specifically, for multivariate linear regression, it is well known that classical additive measurement error typically results in bias in the estimator of slope in the direction of zero – a phenomenon called attenuation to the null. The subtlety arises from the well-documented fact that the specific distribution of the measurement error determines the effects it will have, and thus one has to properly account for this error distribution and devise appropriate methods (see p. 41 and also Ch. 3 of

[4]).

Functional regression considers stochastic processes as covariates, and is a contemporary topic with considerable activity (see [20], [19], [23], [30], [14], [3], [25], [6], [31], [11], [7], [18] to mention only a few). We refer to [21] and [24] for informative and extensive reviews on such functional linear models. Some of these papers assume that the functional covariates are fully observed in the continuum and without any error, while others do indeed consider functional covariates measured on a grid and contaminated by measurement errors: the functional covariate is observed on a grid, say, , and the measurements obtained are . The measurement errors

are then usually assumed i.i.d. zero mean random variables with common variance

, i.e., the error structure is a classical, additive and homoscedastic one.

This assumption allows one to circumvent measurement error via smoothing techniques, and two of the most prominent such approaches involve: (a) obtaining an estimate of the uncorrupted by spline smoothing of its corrupted version [23], and (b) smoothing the covariance of the corrupted observations ’s and then using this to compute finite rank approximations of the true using best linear prediction (see, e.g., [29], [15]). Approach (a) is used when we observe functional data on a dense grid, while approach (b) is to be preferred if the observations are sparse, e.g., in the case of longitudinal covariates. These estimates may then used to carry out further statistical analysis. [19] considered natural cubic spline smoothing of the observed covariate ’s and imposed distributional assumptions to derive a maximum likelihood estimator of the discrete version of the parameters in generalized linear models. This can thus be considered as an adaptation of the likelihood-based approach for measurement error models (see, e.g., Ch. 8 in [4]). [30] used the smoothed estimates of the covariance of developed in [29] to study estimators of slope parameter based on functional principal components for linear regression with longitudinal covariates. In their paper on a penalization criterion based approach for estimating the slope parameter, [18] advocate the use of the general approaches (a) or (b), depending on the denseness/sparsity of the observation grid, when the covariates are measured with error. A different method for dealing with measurement error was studied by [3], who extended the total least squares approach (see, e.g., [13], [27]) for multivariate error-in-variable model to the functional setting. [6] studied an estimator of the slope function based on smoothing splines after correcting for measurement errors using an approach similar to the total least squares method in [3]. A two-step approach has been considered by [11], where the authors first estimate , then express these estimates and the slope parameter using a common spline basis, and then use a mixed effects model to estimate the slope parameter. These authors propose to estimate by first finding a smooth estimator of its covariance using a method in [9] (which is similar to the approach of [29]) followed by spectral projection onto the first few components. [12] considered a variational Bayes approach for functional linear regression, which can be considered as an extension of the Bayesian method for measurement error models with multivariate data (see, e.g., Ch. 9 in [4]). What is common to all these approaches is that they assume an i.i.d. structure for the measurement error, and this ingredient is crucial to their success.

An exception is a the recent Ph.D. thesis [2], in which the author studies an extension of the well-known SIMEX (simulation extrapolation, see, e.g., [5] and Ch. 5 in [4]) technique for multivariate regression with measurement errors to the functional setting with scalar response. Here the author does

also consider the case when the measurement errors may be heteroscedastic and/or autocorrelated, which appears to be the first such attempt in the literature. However,

[2] imposes a parametric specification of the error covariance, and, furthermore, the method and the asymptotic studies are only in the fixed discretized grid setting. This effectively reduces the problem to the classical context of a multivariate version with parametrically specified measurement error covariance. A genuinely functional setup would need to consider a nonparametrically specified measurement error covariance structure and allow for theory and good practical performance as the observation grid becomes dense.

Our Setup and Contributions

Our purpose is to introduce the first genuinely functional method for handling regression with covariate contamination by measurement error that is not white noise. By ‘genuinely functional’ we mean that the measurement error covariance should be nonparametric and that the observation grid should not be restricted to a finite size. Particularly in the functional data setting, it is natural to expect additional (dependence) structure in the measurement error. The error variances may not only be heteroscedastic, but there may well be propagation of error, so that errors associated with adjacent observation points need not be uncorrelated. One can then propose the more general contamination model , where is a measurement error stochastic process that is a valid random element in the underlying function space, possessing a trace-class covariance, and and are uncorrelated. We call this a functional measurement error, to be contrasted with the white noise setup where would be a generalised function interpretable only in weak SDE sense.

Such a general specification of error inevitably leads to identifiability issues which cannot be alleviated without additional restrictions (see, e.g., [22]), not unlike the double/triple whammy mentioned earlier. This is perhaps a key reason why an i.i.d. error structure is typically assumed in FDA. Nevertheless, recent work by [8] established a key nonparametric identifiability result that we will rely on, proving that the covariance operators and of and , respectively, are identifiable provided the covariance kernel of is real analytic and that of is banded, i.e., there exists a (called a bandwidth) such that if and is analytic on an open set containing . The required analyticity of the covariance kernel of is obviously ensured if

is a process with a finite Karhunen-Loève expansion, i.e., it has finite rank, and its eigenfunctions are real analytic. Assuming a banded structure of

and analyticity of the eigenfunctions of essentially translates to asking that the variability of the measurement error process is at a scale purely finer than the scale of variability of the true covariate.

Given this identifiability result, we are naturally led to consider the following setup if we are to entertain a genuinely functional non-white measurement error framework. The true regression model is

where is either a random scalar or function,

is a bounded linear operator from the underlying function space to the space of the response variable

, and is the error term. The true covariate is of some rank , which beyond being finite can otherwise be unrestrictedly large. Instead of , we only get to observe , i.e. a version corrupted by a functional measurement error process with a banded but otherwise nonparametric covariance structure, as described in the previous paragraph.

In the context of this model, propose a novel estimator of the slope parameter in functional linear model, when the erroneous covariate is observed on a discrete grid of points. The estimator is motivate by the regression calibration approach (see, e.g., Ch. 4 in [4]) for measurement error models in the multivariate setting. As part of this approach, one needs to construct a consistent estimator of the of the generalised inverse of the covariance operator of the true covariate , on the basis of observing , which is a severely ill-posed problem in itself (both due to the measurement error, as well as due to having to estimate an inverse) and constitutes the core challenge that one has to overcome. The crucial insight that allows us to overcome this challenge, is to use low-rank matrix completion to separate from (à la [8]) combined with a novel grid-subsampling approach to estimate the true rank and allow for consistent inversion.

Our estimator is defined and studied in Section 2 for both the scalar and the functional response case. In either case, the proposed estimators are shown to be consistent, and their rates of convergence are derived. Perhaps surprisingly, it is shown that rates are entirely feasible (as the grid becomes dense), despite the convoluted nature of the problem. We discuss the practical implementation of the proposed estimators in Section 2.3. In Section 3, we extend our methodology and asymptotic results to the case of quadratic functional regression with measurement error in the covariate. Simulation studies are conducted in Section 4, where the proposed estimator is compared with the spectral truncation estimator (see [14]) constructed using the erroneous covariate. In subsection 4.1, we conduct simulations in the setting when the measurement errors are indeed i.i.d. over the observation grid and compare the proposed estimator with some of the other procedures that are designed for this setup. In Section 5, we discuss the situation when the true covariate

is not truly finite rank but is essentially so, i.e., its eigenvalues decay fast but are not exactly eventually null. Here, we slightly modify the estimator of the covariance of

, but show that performance does not suffer. Finally, we illustrate the proposed estimator using a real data set on hip and knee angle measurements in Section 6.

2 Methodology, Theoretical Properties and Implementation

As discussed in the Introduction, the true regression model is taken to be

but we only get to observe the contaminated covariate instead of itself. Throughout this paper, we will assume that for simplicity. We will also assume that . Note that the assumption is automatically guaranteed if we assume that the measurement error is non-differential (see, e.g., Sec. 2.5 in [4]), i.e., the conditional distribution of given is the same as that of given . This is in turn guaranteed if we assume that the measurement error in the response is independent of the measurement error in the covariate conditional on the true covariate . Under non-differential measurement error, it follows that

since the term inside the braces in the last expectation is zero. Consequently,

meaning that the regression of on is the same as that of on , i.e., one can fit the linear regression instead of the model . These arguments provide the justification for using of the regression calibration approach (see, e.g., Ch. 4 in [4]).

2.1 Scalar-on-function model

We first consider the case when the response is scalar. In this situation, the Riesz representation theorem implies that

for a unique (see, e.g., [17]

). In the multivariate setting, with random vectors

and possessing full rank covariances, the least squares solution of using the regression calibration method is given by the solution of the equation . Recall that and , where and are the covariance matrices of and , respectively. Straightforward algebra shows that the least squares solution is

It is thus observed that estimation of the slope necessitates the construction of a suitable estimator of the (generalised) inverse of the covariance of . In the functional measurement error model proposed in the Introduction of this paper, it is assumed that the true covariate has finite rank , i.e., the rank of . Even if we had a consistent estimator of , this would not immediately guarantee that would be a consistent estimator of (where for any compact self-adjoint operator , denotes its Moore-Penrose generalized inverse; see, e.g., pp. 106-107 in [17]). The map is continuous on the space of finite rank self-adjoint operators if and only if for any sequence , we have for all sufficiently large . This says consistent estimation of itself does not suffice, and we must be able to additionally consistently estimate the rank of in order to be able to consistently estimate accurately: the rank of the estimator of must accurately estimate the rank of . Note that despite being finite rank, determining this finite rank is highly non-trivial in the presence of measurement error, since the potentially infinite rank of the measurement error process is confounded with the finite rank of the true covariate (in the absence of measurement error, estimating the rank would be a trivial problem once the number of observations and grid size exceeded the true rank ).

With these requirements in mind, we now develop our methodology. Suppose that we observe each , , over a grid of points , where for each with being a partition of into intervals of length . We assume the grid nodes to be random. Define the discretely observed covariate vector as for , and define the unobservable vectors ’s and ’s analogously. In this setting, [8] proposed a consistent estimator of which is defined as follows. Let be the covariance matrix of . The estimator of is obtained by minimizing

(2.1)

Here, ranges over the set of positive definite matrices; denotes the Frobenius norm of a matrix; is the empirical covariance matrix of the ’s; is a matrix with th entry ; ‘’ denotes the element-wise (Hadamard) matrix product; and is a tuning parameter. The estimator of is then defined as the integral operator associated with the kernel . In plain words, one estimates the covariance of by a low-rank completion of the band-deleted covariance of (the band deletion in principle decontaminating from the measurement error process). Under suitable conditions, this estimator can be shown to be consistent. However, as is observed from the asymptotic study in Theorem 3 in [8], one cannot guarantee that both the estimator itself and its rank will be consistent for their population counterparts (in fact, for dense grids, may be inconsistent), which implies that we cannot naively use the same estimator in our context, as its generalised inverse may dramatically fail to be consistent for the true generalised inverse (as per the earlier discussion).

Since it is imperative that the rank be estimated as accurately as the operator itself in our case, we introduce a modified two-step estimation procedure, estimating the rank separately from the operator itself: (1) In the first step, we minimise (2.1) on a subset of the grid evaluation points with the purpose of estimating the rank and not the covariance itself; (2) in the second step, we minimise on the entire grid, over matrices , with , where comes from step (1). To explain the logic behind this strategy, we make the following technical observations:

  1. It follows from Theorem 2 in [8] that the covariance matrices of and themselves (not just the continuum kernels they arise from) are identifiable provided that the grid size exceeds the critical value and that each of the grid evaluation points is sampled from a continuous distribution supported on its corresponding partition element . Call such a grid an adequate grid.

  2. Therefore, the estimator of obtained by calculating the rank of the solution of (2.1) on any adequate grid of fixed size is consistent as under the assumption that . Note that this would not yield a consistent estimator of itself, though, since that would require the grid size to grow to infinity as well. See Theorem 3 in [8].

  3. So if we can subsample our original grid to obtain an adequate subgrid and keep the size of this subgrid fixed as , we will be able to estimate the rank consistently. See Section 2.3 for more details on how to choose this adequate grid.

  4. Now reasoning behind the two-step method becomes clear: we use the subgrid of fixed size in order to get a consistent estimator of the rank , and we use the complete grid (whose size is in principle growing as grows) to obtain a consistent estimator of .

In summary, the proposed estimation procedure is as follows.

  • Extract an adequate subgrid of resolution as described above to get an estimator of the rank . Specifically, is defined as the rank of the minimizer of (2.1) based on the subsampled grid of size .

  • Next, use the full grid of resolution to find an intermediate estimator of with rank equal to (obtained in Step 1) as follows:

    (2.2)

    where is a positive definite matrix, and the th element of equals . Construct the estimator of the kernel of as follows:

    Let and denote the eigenvalues and the eigenfunctions of the integral operator associated with the kernel .

  • Define the estimator of as

    and use its Moore-Penrose inverse

    as the estimator of .

Remark 1.

Strictly speaking, we should speak of a minimiser rather than the minimiser of (2.1) or of (2.2), since the minima of these objectives may not be unique. However, it can be shown that for all sufficiently large both objectives will have a unique minimum, which is why we do not insist on making the pedantic distinction.

The regression calibration estimator of the slope parameter in the functional linear regression setting is now defined as

where is the empirical covariance between the ’s and the ’s.

The following theorem provides the asymptotic behaviours of , and .

Theorem 1.

Suppose that , , , , and let be a fixed integer. Suppose that , and as . Then, the following hold as .

  • .

  • and , where is the Hilbert-Schmidt norm.

  • .

Remark 2.

Part (b) of Theorem 1 shows that by utilizing the grid subsampling technique, we have been able to achieve parametric rates of convergence for the estimator of the covariance operator and its generalised inverse. Part (a) shows that we have achieved consistency of the estimator of the rank even when the grid size is very large compared to . This is unlike what is obtained in [8]. Under the condition , which is also required to ensure parametric rate of convergence of their estimator of the true covariance, it is unknown whether their estimator of rank would even be consistent. This is because the consistency of the estimator of rank in [8] is ensured provided the exactly opposite condition holds, namely, , which implies that does not grow any faster than .

Proof of Theorem 1.

(a) It follows from Theorem 3 in [8] that . Since , this implies that

converges in probability to

as . Hence, for each , we have as . Since both and takes integer values, part (a) of this theorem follows upon choosing any .
(b) Fix any ,

(2.3)

where is the estimator obtained in Steps 2 and 3 of the previous algorithm when the chosen rank is equal to the true rank . Denote the matrix form of the minimizer in this situation by , i.e.,

Let us denote the eigenvalues and the eigenfunctions of by ’s and ’s for .
In order to show that , it is enough to show that the first probability in the right hand side of (2.3) is small if we choose a large enough , i.e., . This is because the second probability in (2.3), namely, converges to zero by part (a) of the theorem. Now, observe that

(2.4)

where the last inequality follows using standard results in perturbation theory of operators (see, e.g., [17]) and . So, if we can show that , then using the fact that along with (2.4), it will follow that . Consequently, from (2.3), we will get that .
We will proceed along similar lines as the proof of Theorem 3 in [8]. Then, it follows that we only need to show that . Define the functionals

(2.5)

where is the covariance matrix of . Also, define . This distance as well as the functionals in (2.5) are defined over the space of matrices of rank equal to .
First, observe that it follows from Proposition 2 in [8] that is the unique minimizer of almost surely over the grid provided that . Also, this minimal value is zero. Let and consider the following Taylor expansion:

(2.6)

where for some . Note that and , where and are arbitrary matrices of rank . Observe that . So, (2.6) yields

(2.7)

Next, define . A first order Taylor expansion yields the following simplification of with for some .

(2.8)

Next, it can be shown that for a constant , and the finiteness of this constant is a consequence of the assumption . Thus,

(2.9)

It now follows from Theorem 3.4.1 in [26] that the minimizer of satisfies

(2.10)

as , where the term is uniform in . This and the fact that completes the proof of the fact that . Hence, from the arguments given towards the beginning, if follows that both of and are as .
For proving the second part of (b) of this theorem, note that

. It now follows from the central limit theorem in separable Hilbert spaces (see, e.g.,

[1]) that as

from the weak law of large numbers in a separable Hilbert space (see, e.g.,

[1]). Now, recalling that from the discussion in the Introduction, and using the earlier statement and the first part of (b) of this theorem, we get

as . This completes the proof of the theorem. ∎

2.2 Function-on-function model

We now consider the case when the response variable is also functional. In this case, the true function-on-function linear regression is given by , where is now a random element in the underlying separable Hilbert space, and is the unknown (bounded and linear) slope operator. Otherwise, we still only get to observe the corrupted proxy instead of itself, with and satisfying the same assumptions as before. If were known, the least squares estimator of would be given by the solution of the equation , which equals with . This is due to the identifiability constraint that so that , where is the projection operator onto . Also note that .

This motivates the regression calibration estimator in the functional response case, defined as

where is the empirical covariance operator between the ’s and the ’s, and the estimator is exactly as in the scalar response case considered earlier. We then have:

Theorem 2.

Under the assumptions of Theorem 1 with replaced by , we have as .

Proof of Theorem 2.

Note that . By the central limit theorem for separable Hilbert spaces (see, e.g., [1]) applied to the space of Hilbert-Schmidt operators, it follows that as . So,

The right hand side of the last inequality above is by the earlier fact and part (b) of Theorem 1. This completes the proof of the theorem. ∎

2.3 Practical implementation

To implement the grid subsampling technique, it is useful in practice to find several ’s in Step 1 of earlier algorithm, each corresponding to a different random selection of subgrid points from the original grid . This helps reduce sampling bias and thus results in a more stable estimator. Suppose we find such separate estimators, with the estimators at the th iteration being denoted by . The final estimator is given by the mode of the empirical distribution of the ’s, namely,

Note that the upper bound on the range of values is enforced by the identifiability condition discussed earlier.

Proposition 1.

For any , converges in probability to as .

Remark 3.

The above proposition shows that even if we use instead of in the definition of , the resulting estimator of , and consequently that of (or in the function-on-function setting), will be -consistent. To see this, note that the proof of Theorems 1 and 2 only utilizes the fact that is consistent for and that is fixed. Thus, the proof goes through unchanged for by Proposition 1.

Proof of Proposition 1.

Fix . Since for each , converges to by part (a) of Theorem 1, it follows that as . Also, as . Thus, for any , we have in probability as . So, in probability as . But this is the same as saying , i.e., converges to in probability as . ∎

This being said, the algorithm for constructing the grid sub-sampling estimator of in practice is thus as follows:

  • For the th iteration (), randomly select a sub-grid