Variables measured with error often pose a significant concern for valid inference. The use of error-prone variables impacts analyses in a manner which depends both on the degree of measurement error present, and on the analysis technique that is used. A large number of techniques have been proposed in order to correct for the effects of measurement error. We generalize two correction techniques, regression calibration [Carroll1990, Gleser1990] and simulation extrapolation (SIMEX, [SIMEX]), to relax one of their common assumptions: the availability of independent and identically distributed replicate measurements.
The focus on regression calibration and SIMEX is due to their widespread use owing, in part, to their approximately consistent nature. Approximately consistent methods trade rigorous mathematical guarantees for more general utility and ease of implementation. Both of these methods make consistency guarantees in only a limited number of scenarios, but are useful to reduce the impact of measurement error in a wider variety of settings. They afford comparatively straightforward implementation, and can be applied directly to most available estimation techniques, making them attractive in applied research. However, these techniques often require restrictive assumptions about the availability and structure of the data. Our proposed methods allow for a generalized measurement error model whereby each repeated measurement is assumed to be an error-prone observation of the truth, potentially subject to different measurement error models.
By contrast, many methods focus on measurement error in specific models with an emphasis on producing consistent estimators. Consistency is a desirable property which often comes at the cost of specific model assumptions and complex implementation, thus limiting the scope of application. Violations of these assumptions may leave the methods with limited utility or render them entirely useless [CarrollBook, SIM_tutorial, YiBook]
. Certain consistent methods, such as moment reconstruction[MomentReconstruction], make similar assumptions regarding replication, and we extend our techniques to them as well.
As a motivating example, we consider the Framingham Heart Study [Framingham]. The Framingham Heart Study is a large cohort study investigating coronary heart disease. Following previous analyses in the measurement error literature [CarrollBook], our interest concerns the impact of long-term average systolic blood pressure on the development of coronary heart disease, while controlling for other risk factors. Systolic blood pressure is considered error-prone since it is a single time point measurement which may differ substantially from the long-term average blood pressure. The study reports four separate measurements of systolic blood pressure, two at each clinical visit, from two visits four years apart. Analyses of these data often consider these four measurements as replicated measurements of the same underlying value [CarrollBook]. That is, the systolic blood pressure measurements are considered independent and identically distributed realizations of the same variable. While this assumption is a reasonable approximation, CarrollBook note that these sets of measurements deviate substantially from one another. This is the setting our techniques are developed for.
2.1 Measurement Error Background
We take to index over subjects, and for the th subject define to be the true value for our variate of interest, to be its observed version or surrogate value, to be a numeric outcome, and to represent all other covariates which are measured without error.
Measurement error is often assumed to follow an additive model. The classical additive model posits that we observe , where , is constant with respect to , and is independent of and . The surrogate is thus an unbiased measurement of , in that . Often,
is assumed to follow a normal distribution.
We may also consider multiplicative models which typically assume the form , with , and constant variance given by , with independent of both and . Thus, is unbiased for . Many correction techniques in the literature assume an additive structure, and it is often advised to transform a measurement that has multiplicative error onto a scale where the error becomes additive [CarrollBook, Multiplicative_Transformations].
Our notation implies that the random variables are scalar valued. Vector valued random variables can be accommodated by allowing each component to follow analogous expressions. For some results, it will be convenient to unify this framework notationally. We can writeor , where is a mean-zero, unit variance random variable, independent of , and , and is constant.
In order to correct for the effects of measurement error, its size must be assessed, and additional information is required to do so. In addition to the assumed measurement error model, applicable methods are often dictated by what auxiliary data are available. These may come in the form of validation data, where for some individuals both and are available, the aforementioned replicate data, where repeated independent and identically distributed measurements of are taken, or instrumental data. Instrumental data involve the measurement of and an additional variable , called the instrumental variable, which is related to and only through the true value, . Replicate data are a particular form of instrumental data.
Validation data are typically preferred, but are often unavailable or impossible to collect. There is literature which uses instruments for the correction of measurement error effects, some of which extends the idea of regression calibration [InstrumentalBackground, CarrollBook]. However, the conditions required for a variable to be viewed as an instrumental variable are often difficult to verify in practice, and falsely assuming that a factor is an instrument can lead to large biases, even when the bias introduced from the measurement error itself is small [CarrollBook]. Consequently, most of the literature has focused on the use of replicate measurements. When using replicate data, we assume that we have a sample in which we observe , where are independent and identically distributed.
2.2 Regression Calibration (Best Linear Unbiased Prediction Technique)
Regression calibration [Carroll1990, Gleser1990] models the relationship between and based on an assumed model between and and a model between and . While there exist many specific implementations of regression calibration, we consider the best linear unbiased prediction (BLUP) technique [CarrollBook].
The BLUP technique assumes that there is a standard model, informed by the underlying subject matter, which relates to , and that is well represented by a linear structure. Then, we replace in the standard model with , representing an estimate of . If we assume that replicate measurements are available, we define the BLUP of , as the linear quantity which minimizes the mean squared error, when viewed as a predictor of . That is, we find such that is minimized, with respect to , , and . Taking and to represent the mean of and the covariance between and , we get
Under the assumption that replicate data are available, we usually take . All quantities in equation (1) can be estimated using the replicate measurements. In the event that and are jointly normally distributed, this method consistently estimates the conditional mean
. If the joint distribution ofand is non-normal, but is sufficiently small, it has been shown that this technique can provide an acceptable prediction of the conditional mean of given and [Carroll1990].
Once we have values for , we fit the model for given , using instead. Our interest is in a vector of parameters, , which are consistently estimated as the solution to an estimating equation, say, , where when . We also call an M-estimator. While the M-estimator approach is not strictly necessary, it is sufficiently general to cover most estimators of interest. The regression calibration estimator, , is then the solution to .
If the model for in terms of is linear, and the BLUP consistently estimates the conditional mean function, , then the resultant estimator will be consistent, provided mild regularity conditions. If the model for
is non-linear, or the mean function is not consistently estimated, we cannot generally guarantee consistency. In certain non-linear models, however, specific claims can be made. In a log-linear model, with the correct mean function, all slope parameters will be consistently estimated. In a logistic regression model bias will generally be reduced and, if interest is in the estimated probabilities rather than the parameters, the approximation is often quite accurate[CarrollBook]. Thus the regression calibration estimators are regarded as approximately consistent [CarrollBook]. Under regularity conditions, after certain transformations, they are asymptotically normal [Carroll1990].
2.3 Simulation Extrapolation
Simulation extrapolation [SIMEX] is divided into two steps: simulation and extrapolation. In the simulation step, data subject to larger measurement errors are simulated, so that the analyst can see the impact of this error on the bias of the estimated parameters. Then, in the extrapolation step, this relationship is extrapolated back to the case where no measurement error is present. We will continue to assume the parameters of interest, , are consistently estimated in the error-free setting using an M-estimator. Our presentation of SIMEX will assume that is a scalar, but the method applies in higher dimensions. Suppose that where is zero-mean with unit variance, and is known. For some positive constant , we construct the quantity , where the are generated by the analyst to be independent and identically distributed pseudo-random variables, independent of the .
For , the variance of conditional on is . If , then our measured values would not deviate from the truth, and taking reflects this situation. For any , we can generate as the solution to . Averaging the results over independent simulations produces an arbitrarily precise estimator for the quantity . Repeating this process for many values of , say , generates a sequence of estimators for .
We can model these values as a function of , parametrically as . Using least squares estimation we can estimate as . Extrapolation occurs by taking . Generally, one of three extrapolants are used: a linear extrapolant, , a quadratic extrapolant, , or a nonlinear extrapolant, , where , , and are regression coefficients [SIMEX].
If the errors are normally distributed, then for a correctly specified, sufficiently smooth [SIMEX_jackknife], consistently estimates . More broadly, SIMEX can be seen as an approximately consistent estimation technique. It consistently estimates , which itself approximates the true value when , assuming correct specification of . The quality of this approximation will dictate the quality of the estimator.
This discussion has assumed that the error variance is known, which is seldom the case. When replicate data are available, and if error variances are assumed to be homogeneous, then it is possible to use the replicate measurements to estimate the error variance, and use this estimated quantity in place of . Under regularity conditions, the resultant estimators are asymptotically normal [SIMEX_asym]. When replicates are available, but error variances are heterogeneous, a modified version of SIMEX, known as the empirical SIMEX, can be used [SIMEX_empirical]. We focus on the standard SIMEX.
3 Generalizations of the Methods
3.1 Data Structure and Identification
We present a model which relaxes the assumption of identically distributed replicate measurements, which we refer to as the generalized measurement error model. We will assume that, for the th subject, proxies, , are observed. We assume that each proxy is either additive, , or multiplicative, , where is a zero-mean, unit-variance random quantity, and , and are, possibly unknown, constants. All error processes are assumed to be independent of each other and of . The multiplicative error case can be made multivariate using Hadamard products, denoted . On occasion, the and terms may be more clearly expressed as single random error terms, with zero-mean and variance or unit-mean and variance, respectively.
In this setup, taking to be the indicator function, we have: ; ; ; and , where , indexes proxy measurements and indexes the individuals. We let represent the diagonal matrix with the elements of along its diagonal. Further, , and , where, and are matrices that capture the variance of the assumed error model, taking the form of if an additive structure is assumed, and or otherwise.
3.2 Parameter Identification
We must impose restrictions on some model parameters in order to render the model identifiable. We will assume, for some set of , that (1) , (2) , or (3) both and . These assumptions also capture the case where, for instance, for any constant . If is non-zero, we can work with , leaving us with a measurement satisfying assumption (1). When and for all , our model reduces to that of having unbiased measurements of , from possibly different distributions. Other assumptions are also feasible.
We define , , and to be the index sets for the proxies corresponding to assumptions (1), (2), and (3) respectively. We assume that and , which is not necessary, but will suffice for the identification of the parameters. We denote the parameters given by and as and , respectively.
Assuming that for all , we take ; ; ; and . If is not observable, then we can take and . If is observable then we take ; ; and . With observable, is permissable. We collectively refer to these estimators as the correction parameter estimators. The parameters needed to compute these correction parameter estimators will not typically be known, however, they can be consistently estimated with the observed data.
The correction parameter estimators can be expressed as the solution to unbiased estimating equations subject to standard asymptotic theory. We can write that
The correction parameter estimators can be expressed as the solution to unbiased estimating equations subject to standard asymptotic theory. We can write thatis given by the solution to , and for the true value , we have . The form of is given in the supplementary material as equation (S5). Proof: See the supplementary material.
By altering the form of , this result applies if is available. It provides a mechanism to assess the performance of the correction parameter estimators. Standard asymptotic results demonstrate that converges in distribution to , as . Here and , and both are estimable consistently from the data. As outlined in Section 2, we focus on estimation techniques which can be framed as M-estimators. Using Lemma 1 we can derive the asymptotic distribution for estimators derived from any correction methods that use the contents of and M-estimation.
Assume that solves the empirical estimating equation from Lemma 1, denoted , and that is a solution to the empirical estimating equation , where and are estimating and , respectively. Then we have that converges in distribution to , as , where , for , is upper-triangular, and is symmetric. Here is the dimension of , and is the dimension of . Proof: See the supplementary material.
3.3 Regression Calibration
Regression calibration using the BLUP technique functions almost equivalently in the case of our generalized error models as in a standard error model. We modelas a linear function and then estimate by solving . When assuming an additive measurement error model with replicate observations, we take the mean of the replicates as . In the case of non-identically distributed measurements, it is unlikely that this will be the most efficient combination. Intuitively, measurements with lower variance ought to contribute more to our proxy measure than those with high variance. We define , for some set of weights such that , with for . Taking , , and , simplifies to the standard estimators [CarrollBook].
This suggests that the standard regression calibration technique is resilient to the assumption of independent and identically distributed replicates, without modification, when . However, in the standard measurement error setting, a single error variance term, , is estimated, based on the fact that . If the replicates are selected to be unbiased from our general model, then this estimator, denoted , is better viewed as an estimator of . This renders the parameters in the standard implementation hard to interpret.
Additionally, this may cause issues for future investigations. Consider the Framingham Heart Study, where replicate measures of blood pressure are taken at two clinic visits four years apart. Using the standard implementation for regression calibration, our estimate is consistent for , where we are averaging over the four separate measurements, two at each clinic visit. The standard correction is then only valid for individuals with all of , , , and measured, which may not always be possible. In our framework we can proceed with the correction using only a subset of these measurements, allowing for patients outside the sample to be accommodated in the model. In the standard procedure, this is not possible as .
A final concern is that the standard procedure forces the weights chosen to be , which are unlikely to be optimal. We advocate for adding as parameters to the BLUP before minimizing the mean squared error between and . This provides the set of optimal weights in the same sense that the BLUP provides the optimal set of . It is not possible to derive a closed form expression for the set of weights. Instead, we use numerical optimization. In Section 3.5 we discuss the setting when for some , where the standard estimators are not consistent.
The BLUP is only a consistent estimator for the conditional expectation, , when the conditional expectation is linear in the conditioning variables. In the supplementary material we provide Lemma S1, generalizing Lemma A.1 from Carroll1990 which uses their notation for matrix derivatives and the trace operator, and specifies the inverse of as . With this we characterize the linearity of the BLUP.
Under the generalized error models presented, assuming that , we have that
when . Proof: See the supplementary material.
The term is linear in if and only if is normally distributed [Carroll1990]. Since we are conditioning on , we can exclude values of this ratio which are unobservable almost surely. As a result, domain indicators can be dropped. Then, for the case of additive measurement error, the conditional mean (2) will be approximately linear if either is linear and is constant, or if is constant and is linear. Linearity in the multiplicative mean (3) is more restrictive. Here, due to the additional multiplicative term, we need both to be constant and . If is constant, then the first term of equation (3) will be . The second term is only if or is constant. As a result, it is sufficient to have uniformly distributed, and for to be constant. This illustrates the caveats with applying this method to multiplicative errors. In many situations, the linear approximation for the additive case will be sufficiently good. However, in the multiplicative case the expectation will be non-linear under most assumed models.
Consider the case when no is measured, and is a scalar. In order for to be linear, additive models require to be linear, and multiplicative models require to be constant. If we have two observations from the generalized error model, and , then direct calculations show that will be linear under the same conditions that render linear. This applies symmetrically to . Checking the goodness of fit of a linear model between any two proxies in turn checks the adequacy of a linear approximation. This highlights the relationship between our methodology and instrumental variable approaches based on regressing a measurement of the truth on an instrument [CarrollBook]. These results justify both the theoretical conditions under which a linear model is warranted, and a technique for checking whether linearity holds approximately. The modified regression calibration procedure also produces asymptotically normal estimators.
Under standard regularity conditions, the estimated parameters using the regression calibration correction, , are consistent for the parameters , and are asymptotically normally distributed, such that, as , converges in distribution to , where , for matrices analogous to those in Lemma 2. Proof: See the supplementary material.
Importantly, this result shows asymptotic normality, not around the true value, , but around , the solution to . Under regularity conditions, is the probability limit of . The asymptotic bias of the regression calibration correction is thus determined by the difference between and , with consistency achieved when . The previously discussed results regarding consistency, and approximate consistency, of regression calibration methods apply to the modified technique, whenever is consistent for .
3.4 Simulation Extrapolation
To modify the SIMEX , it is insufficient to add additional error with variance , as will not be constant across . Instead, we rely on and to motivate the modified versions of SIMEX. The strategy will be to match the first two moments of and , as . For fixed , we take
where is an appropriately sized standard normal random variable, independent of all covariates. Given , we find that , and . Thus, as , the first two moments of and agree.
We do not typically have , , or available, and as a result we will estimate them from the proxy observations. In the standard measurement error setting, if homoscedacticity is assumed, then SIMEX progresses using and . As with regression calibration, if , the standard SIMEX applies to non-independent and identically distributed data, with the same caveats. The empirical SIMEX relies more heavily on the independent and identically distributed property, and while it is not strictly necessary, it is not readily facilitated in our more general case. As a result, we continue to focus on homoscedastic errors.
We consider two ways of combining the replicate measurements based on equation (4) either (a) averaging estimates of or (b) averaging the proxies. For (a), we find for each , and combine these estimates. For (b) we use (4), where in place of we take for some set of weights .
The SIMEX correction is only approximately consistent. The quality of the approximation is determined by (1) the quality of the extrapolant, and (2) how well the matching of the first two moments of and approximates the matching of their distributions. As a result, if the extrapolant is correctly specified and, for instance, is normally distributed, then SIMEX is consistent. The estimator is generally asymptotically normal and is consistent for , the limit of as .
Under standard regularity conditions, the estimated parameters using the SIMEX correction, , are consistent for the parameters , and as , converges in distribution to , where is estimable through sandwich estimation techniques. Proof: See the supplementary material.
3.5 Weighting and Incomplete Replication
We introduced , a set of weights which take advantage of the unequal variances to more efficiently combine proxies. In our generalized regression calibration, we consider including them as parameters in the BLUP, as it is straightforward to implement. In our simulations, we found the choice of weights seems to impact the estimates less than other factors, although the added flexibility may be useful in some settings. For instance, when the SIMEX estimator averages after extrapolation, we can view the set of estimates obtained prior to averaging as separate estimates of . These estimators are jointly asymptotically normal. This allows us to apply results regarding the efficiency of combinations of estimators. If , then will be the optimal weight. If there is a large difference in the variances, or if the covariance is fairly small, then the optimal combination deviates largely from equal weighting.
The development until now has assumed that for every individual we have proxies. If this is not the case, and only some subset of subjects have particular proxies observed, then the standard methodology will not provide valid estimators. If we take to be the mean of observations for individual , then under the independent and identically distributed assumption will have variance . However, if these replicates are not independent and identically distributed this no longer holds, and the variance will depend on which the individual has measured. Because of this, whenever the set of proxies available for each individual is not constant across all individuals, the standard implementations will fail.
If we assume that the set of replicates available for each individual are independent of the measured variables, we can continue in our modified framework. Under this assumption of ignorable missingness, the th proxy’s parameters are consistently estimable using only the subjects which have the th proxy available. The function that the M-estimators are based on, , can be modified to include observation indicators of . Once the correction parameters are estimated, the BLUP formulation modifies equation (1), noting that for each , will have different distributions. This summation is a slight abuse of notation, which we take to mean the summation of the available proxies for individual .
For SIMEX, if averaging the proxies directly, we group individuals based on the set of proxies that they have available. Then, in each group, we average the proxies and use equation (4), and the standard SIMEX procedure, to derive separate estimates for each group, which can be combined. If averaging the estimates instead, we compute the SIMEX estimate for each of the proxies, ignoring the individuals who do not have a measurement for that value. This produces a set of estimates, which can then be averaged. If the proxies are not independent and identically distributed, there is incomplete replication, and if this missingness is not ignorable, then alternative methods for correction will need to be explored.
4 Simulation Studies
4.1 Linear Regression Models
To investigate the behavior of the proposed methods in our generalized measurement error framework, we compare them with the standard implementation of these techniques in three simulated scenarios. First we consider a linear regression. We take, with , , and to be the true covariate vector, where all components are assumed to be independent. The outcome is taken to be , where . We generate three error-prone proxies, , where , with , and , and where , and . We select of and of to be missing completely at random. All error terms are generated independently of each other and of all other quantities.
We estimate model parameters using (1) standard regression calibration, (2) standard SIMEX, (3) empirical SIMEX, (4) generalized regression calibration using equal weighting, (5) generalized regression calibration solving for optimal weights, (6) generalized SIMEX where proxies are combined, and (7) generalized SIMEX where the estimates are combined. These simulations were repeated times with a sample size of . Results are summarized in Fig. 1.
The generalized procedures tend to correctly estimate the parameter values, producing similar results. As expected, the standard regression calibration and SIMEX procedures perform very poorly, despite this being a linear model with normal errors.
4.2 Log-Linear Regression Models
Next, we consider a log-linear model. We generate and . The outcome is a gamma random variable with . We generate three error-prone proxies, where , and and are independently generated as , and the errors are all independent. For we selected of the observations to be missing completely at random. We compare nine methods, the seven from the linear regression simulations, as well as both the weighted and unweighted regression calibration correction where the correction parameters are computed ignoring .
The results are summarized in Fig. 2. The generalized regression calibration methods perform well for the slope parameters, and the generalized SIMEX methods perform well for all three parameters. We also emphasize that these results are based on a multiplicative error model, showcasing the capacity of both methods to function in this scenario.
4.3 Logistic Regression Models
Finally, we consider a logistic regression model. We take the true covariate , with , where . We generate three error-prone proxies where both and are given, independently, by , and where . We select of to be missing. In these simulations we compare the results of the generalized estimators, using either all of or only the independent and identically distributed replicates for the corrections, labeled all and IID respectively. We differentiate between the weighted generalized regression calibration and the standard version, as well as the SIMEX estimator that averages proxies versus the one which averages estimates.
In Fig. 3 we summarize the results of the parameter estimates and in Fig. 4 we summarize the results of the estimated probabilities. These demonstrate the bias reduction and effective probability estimates of both techniques in logistic regression. Moreover, these simulations demonstrate how biased, both using and , proxies can stabilize estimators.
5 Extensions to Other Methodologies
Moment reconstruction [MomentReconstruction] is a method that was developed in a spirit similar to regression calibration, but which sought to overcome some of its shortcomings. In particular, analysis is conducted using substituted for , where is an estimated version of selected such that the joint distribution . Unlike regression calibration, moment reconstruction involves solving for the forms of parameters based on the assumed underlying model. We present the results of moment reconstruction in a logistic regression, a case where the moment reconstruction estimators are consistent whereas the regression calibration corrections are not. The primary motivation for this presentation is to demonstrate how the identification of parameters as in Section 3.2, and the related results, can be extended to some exact correction methods, using problem-specific derivations.
Assume that, conditional on , follows a distribution. Then, for each subject, we take , where , for , and . This results in having the same first two conditional moments, given , as does. This setup readily presents M-estimators, extending the quantities in Section 3. These are presented explicitly in the supplementary material.
Denoting the probability , the logistic regression estimators for and are and , with these estimators simultaneously solving and . The moment reconstruction procedure replaces in the above estimating equations with,