Indices are developed to assess the agreement among different raters or different measurements from a rater on the same subjects. Barnhart et al. categorized existing approaches for evaluation of agreement into descriptive tools, scaled summary indices attaining values between -1 and 1 and unscaled summary indices[barnhart2007overview]. She further elucidated that unscaled indices, such as coverage probability (CP) and total deviation index(TDI), are preferred for assessing the agreement in a core lab setting with following advantages: (1) they are simple to implement ; (2) they can be interpreted intuitively in terms of the original measurement unit; (3) the CP can provide actionable results that guide readers to identify the source of the agreement to improve their measurements[barnhart2016choice]. CP quantifies the chance of the difference between two measurements from two given raters on the same subject being less than a pre-fixed acceptable boundary . TDI, as a counterpart of CP, is the boundary where the difference of two measurement falls into with a pre-specified confidence or probability. In practice, it may be desirable to set more than one acceptable/tolerable boundary up to a maximum acceptable difference with corresponding acceptable CPs. For example, the British hypertension society protocol (BHSP) for the evaluation of blood pressure measuring device [o1993british] shown in Table 1
. This protocol classifies the grade of blood pressure measuring devices by specify the satisfactory CPs for multiple pre-specified differences (Table1). Therefore, it is useful to summarize the agreement based on a coverage probability curve defined as the curve of coverage probability for a range of differences. A relative area under coverage probability curve (RAUCPC) was introduced as a summary index as a measure of agreement[raucpc].
|Pre-specified Difference (mmHg)|
|Pre-specified Coverage Probability|
|D||Fail to achieve C|
20mmHg is not included in the original protocol but added here as the maximum acceptable difference
CP, TDI and RAUCPC are all originally defined for two raters. However, there may be some competing new raters developed at the same time that need to be compared to each other or to an existing rater. Here we are interested in the interchangeability among more than two raters. For example, in the data set published in the Bland and Altman’s paper [bland1986], the blood pressures of 85 patients were measured by two human observers and one device. We are interested in whether these three raters (two human observers and one device) can be used interchangeably. Therefore, it is desirable to extend the CP, TDI and RAUCPC to measure agreement among multiple raters while preserving the intuitive interpretation of pairwise version. Lin et al.[lin2007unified] first extended the concept of CP and TDI to multiple raters using two-way mixed model and later Jang et al.[jang2018overall] proposed a new set of definitions based on the root mean square of pairwise differences(RMSPD). Although these overall agreement indices give some insight of the closeness of tested raters, they have limitations due to assumptions that may not hold in practice. First, the overall unscaled indices proposed by Lin is an approximate measure that are good only when following assumptions hold[lin2007unified]
: (1) the relative bias square is small; (2) the measurements follow normal distributions; (3) homogeneity across different raters. For overall indices proposed by Jang et al.[jang2018overall], though the assumptions are relaxed, it still requires the difference from two measurements on the same subject is normally distributed. We will demonstrate these assumptions do not hold for the blood pressure measurements[bland1986] in section 4.
In addition to the distributional assumption, the overall indices proposed by Jang et al.[jang2018overall] is difficult to have intuitive interpretation in practice. For example, for the overall CP(OCP) proposed by Jang et al., the satisfactory boundary is specified based on the root mean square of all pairwise differences among the raters. While an acceptable difference between two measurements can be chosen based on clinical implication, but it is not easy to choose an acceptable root mean square of differences in practice because its magnitude is difficult to interpret in terms of clinical judgement. Moreover, a satisfactory agreement through this OCP cannot guarantee the raters are interchangeabe. For example, if one rater has large departure from the rest raters and thus not interchangeable with others, but by averaging squared pairwise differences the resulting RMSPD can be acceptable and leading to claiming all raters are interchangeable. Similar problems could happen for their overall TDI and RAUCPC as well. Last but not the least, Jang’s method[jang2018overall] cannot be applied to the measurements by raters with replications and would need to be applied by restricting to one measurement per rater on each subject.
To address aforementioned issues, we propose new sets of overall CP, TDI and RAUCPC based on the maximum pairwise distances (MPD) to assess overall agreement among all considered raters. The new indices have intuitive interpretation in terms of original measurement unit and can be estimated by a unified non-parametric distribution-free paradigm based on Generalized Estimation Equations (GEE). The GEE approach could assess the inter- and intra-rater agreement simultaneously without normality and homogeneity assumptions. Moreover, under minimum set of assumptions, we show that the the estimator will achieve the semi-parametric efficiency bound using the working independence covariance matrix. The paper is organized as follow. In Section 2, we first propose a new estimator of pairwise RAUCPC. Then we introduce the new definitions of overall CP, TDI and RAUCPC and with the new estimator of RAUCPC, we are able to develop a unified GEE approach for estimation and inference for all overall indices. We provide the simulation results in assessing the performance of the unified approach and illustrate the method with the example of the blood pressure data from Bland and Altman[bland1986] in Section 3 and Section 4 respectively. Finally, we draw conclusions and provide some discussions in Section 5.
We are interested in whether raters can be used interchangeably for making same type of measurement in a given population. For a subject randomly sampled from a population, we denote as the measurement taken by rater . The interchangeability among the raters can be based on a distance metric, , which reflects the closeness among measurements given by the raters. For , the distance metrics is defined as
It is intuitive that a smaller distance implies a better agreement between two raters. For using these two raters interchangeably, one would like to have a high probability that this distance is within an acceptable difference. Therefore, the concept of coverage probability is used and it is defined as the probability that the distance falls within a boundary ,
is the cumulative distribution function ofover the target population. The higher the CP(d), the better the agreement. To use CP to claim satisfactory agreement, we need to pre-defined a clinically acceptable boundary and the corresponding satisfactory probability . If is greater than or equal to , we can claim that two raters are interchangeable. The TDI, as a counterpart of CP, is defined as the boundary that proportion of the distances falls within
The smaller the TDI is, the better the agreement is. To use TDI to claim satisfactory agreement, we need to pre-defined a clinically acceptable probability and a satisfactory boundary . If is less than or equal to , then we claim that two raters are interchangeable.
In practice, it is sometimes desirable to control the quality on more than one pre-specified differences, or we may simply want to summarize the agreement based on differences up to a maximum acceptable difference. This leads to consider a coverage probability curve . Relative area under the CP curve (RAUCPC) is proposed [barnhart2016choice] as a summary index of coverage probability curve by the scaled area under the coverage probability curve to a maximum acceptable difference . The is often chosen as . Specifically, RAUCPC is defined as
RAUCPC ranges from 0 to 1 and a greater value indicates a better agreement. To use RAUCPC to claim satisfactory agreement, we need to pre-specify an acceptable RAUCPC, . If RAUCPC is greater than or equal to , the we claim that two raters are interchangeable. It is not obvious how to choose and we can use the British hypertension society protocol to illustrate one way of choosing . As shown is Table 1, we can set . For grade C device, satisfactory coverage probabilities are specified for the absolute differences of 5, 10, 15, and 20. By linearly connecting these specific points, for , yields a satisfactory coverage probability curve for grade C device. Curves corresponding to Grade A, B and C are shown in Figure 1. The shaded area for Grade C is equal to 11.8 and the the RAUCPC is 0.59. Thus, one can use as the criterion for satisfactory Grade C device. Similar can be computed for claiming Grade A or Grade B device.
Jang et al.[jang2018overall] extended the pairwise CP, TDI and RAUCPC to multiple raters by defined the distance metrics as
Their overall CP based on the was defined as
However, this definition may claim a satisfactory overall agreement and fail to detect the outlier raters due to the averaging. For example, suppose there are four competing raters where the first three raters give identical measurements and the fourth rater gives measurements greater than the other three raters by 5 on all subjects. Suppose the pre-defined clinical meaningful boundary is. Then for all subjects implying and thus it would satisfactory agreement, even through the fourth rater gave clinically different outcomes on every subjects with the other three raters. Thus there is a need for a better distance metrics for assessing agreement among multiple raters
2.1 Proposed Overall Agreement Indices
For two raters, aforementioned agreement indices defined interchangeability in the sense that whichever rater measures the subject the results are clinically similar. We want to extend this intuition of interchangeability for the situation with more than two raters. As a example, a patient walks into a local clinic and has his/her blood pressure measured by one of the multiple nurses there. The patient would expect that his/her blood pressure measurement will be similar no matter which nurse takes the measurement, in the sense that the nurses are interchangeable. Therefore, the largest difference between any two nurses should be clinically negligible. Following this idea, we define a new distance metric, maximum pairwise difference (MPD), among raters as
The maximum pairwise difference could avoid the pitfalls brought by averaging all pairwise difference and can be reduced to the pairwise difference when . The new overall coverage probability (OCP) and overall totally deviation index (OTDI) based on MPD are defined as
where is the cumulative distribution function of MPD. The OCP measures the probability that maximum pairwise difference among raters are less than a given acceptable boundary with clinical meaning. To use OCP to claim satisfactory agreement, we need to pre-define a clinically acceptable boundary and the corresponding satisfactory probability . Since the PMD is in the same unit of pairwise distance, can be chosen similarly with the pairwise CP. If , then chance that the measurements given by J raters on the same subject are within distance is greater than . As a counterpart of OCP, OTDI captures the boundary that all possible pairwise differences among the raters of % subjects are fall into (-OTDI(), OTDI()) and if OTDI() is smaller than the pre-set satisfactory boundary then we could claim that the raters are interchangeable.
When there are more than one acceptable boundary is of interest or we are interested in an aggregated agreement up to a maximum boundary , the overall relative area under the overall coverage probability curve, as an extension of RAUCPC, is defined as
Like RAUCPC, RAUOCPC ranges from 0 to 1 and higher values indicates a good agreement. Since the MPD preserves the original unit of the measurement, we can use the clinical information from the pairwise version about the acceptable boundary for setting multiple boundaries for MPD. In this way, we could set the satisfactory RAUOCPC the same way as we did for the pairwise RAUCPC.
2.2 Estimation and Inference
We propose a unified distribution-free GEE approach to estimate and make inference on OCP, OTDI and RAUOCPC. Minimum assumptions are made as follows: (1) measurements of different subject are independent; (2) replicated measurements are i.i.d. given the same rater on the same subject. Since current estimators for RAUCPC proposed by Barnhart[raucpc] can not be expressed as a sum of function of each subjects, they are unable to fit in the GEE framework. Therefore, in this section, in order to have a unifed GEE model for RAUOCPC together with OCP and OTDI, we first propose a new unbiased non-parametric estimator for pairwise RAUCPC and RAUOCPC. Second, we present the unified GEE model and the inference approach.
2.2.1 Unbiased Non-parametric Estimator for RAUCPC
Barnhart[raucpc] proposed both parametric and non-parametric approaches for estimating RAUCPC. The parametric RAUCPC estimator is calculated based on the estimated density function of where the measurements are assumed to follow a normal distribution. While for the non-parametric estimator, suppose all distinct observations of are with the corresponding estimated values of . Let and , the estimated RAUCPC is equal to the area under the connected line based on trapesoid rule. It is clear that both estimators cannot be expressed as a sum of independent functions of individual subjects. Therefore, a new unbiased nonparametric estimator is developed below for our unified GEE framework.
The RAUCPC is defined as the scaled integration of cumulative distribution function of distance metric from 0 to a maximum acceptable boundary . By the role of integration by parts,
We note that equation (11
) has the form of expectation for the following new random variable,
The relative area under coverage probability curve can be expressed as, i.e.,
Similarly, for RAUOCPC with multiple raters, . Followed by the Lemma 2.1
, we can use moment estimatorfor RAUCPC/RAUOCPC when there is no replications. This form of estimator can be easily incorporated into GEE framework when there are replicates.
The performance of this new non-parametric estimator of RAUCPC is assessed via simulation and the results are presented in Appendix F. In general, for both normally and non-normally distributed data, the new proposed non-parametric estimator has similar or better performance than the previous estimators[raucpc] in terms of bias and mean square error(MSE).
2.2.2 Unified Generalized Estimation Equation Approach
Let be the th successive replicate measured by th rater on th subject. It is reasonable to assume that successive replicates measured by same rater on same subject are equivalent where , are identically and independent distributed when we condition on th subject and th rater. This assumption implies that the unconditional distribution of has a distribution with an exchangeable correlation matrix where we denote , , is a matrix with 1 as elements and is the number of replicates for th rater. For simplicity, we assume that the number of replicates are equal to for all raters per subjects and it can be easily extended to unbalanced design. Since MPD is defined over a collection of J measurements with one from each rater on the same subject, when there are replications of each raters, we can get distinct collections of , . On each collection, we could compute the observed MPD and index it by . The MPD at the th collection, , is expressed as
If , then this distance reduces to the distance for two raters mentioned above. For a random subject
, we have a random vectorwhere has the same marginal distribution .
To develop a unified form, we first denote the agreement index of interest, OCP, OTDI or RAUOCPC, as and channel it with the parameter for estimation, , in GEE model with link function
. For OCP and RAUOCPC ranging from 0 to 1, logit transformation is used. For OTDI, since it is greater than 0, the natural log transformation is used. After transformation, the parameter of interest,, in GEE model can range form to . Under the standard GEE framework[liang1986longitudinal], we need to find a function such that . Let be the indicator function, we choose the following corresponding to different agreement parameters of interest
Now we construct the generalized estimating equations system based on as follow,
, , with working correlation matrix and where is the covariance matrix of . For OCP and RAUOCPC, we have which does not depend on .
For OTDI, we note that is not differentiable with regard to and a different definition of in equation (19) is needed. Rather than differentiating , we would differentiate its expectation, , where is the marginal cumulative distribution function of . Then, where is the marginal density function of at point . The marginal distribution, , is a nuisance parameter and needs to be estimated in order to solve equation (19).A smoothed kernel density[duong2019package] can be used and implemented by using a R function kde in R 3.1.1. With this new in (19), we show that the limiting distribution of still follows the general results of GEE in Appendix B.
Therefore, across three kinds of agreement indices, can be expressed as
where for OCP and RAUOCPC and for OTDI.
Now we consider the specification of working correlation matrix with nuisance parameter . The optimal asymptotic efficiency of is achieved when coincides with the true correlation matrix of [wang2005effects]. A misspecified working correlation matrix could greatly compromised the efficiency especially when sample size is small and design is unbalance[wang2003working]. Therefore, generally a working correlation matrix resembling the truth is desirable. However, for our model, using the independent working correlation matrix will obtain the same efficiency (see theorem below) as the true correlation matrix. This may seem to be surprising, but it is due to the unique structure of true correlation matrix for agreement data as shown in the lemma below. Therefore, we will use the independent working correlation matrix in equation (19) without the need to estimate the nuisance parameter in the working correlation matrix.
Given are i.i.d, the sum of elements in each row of correlation matrix of , , are equal.
The GEE estimator obtained from equation (19) will achieve the same asymptotically statistical efficiency under either the true correlation matrix or the independent working correlation matrix. The limiting distribution of is
where is the row sum of , and .
The proof is in Appendix D. The variance form in equation (21) is the simplification of the robust sandwich estimator after utilizing equation (20) and Lemma 2.2. The estimation from equation (19) and inference (21) can be obtained by standard statistical software that implements GEE. Moreover, the results can be easily extended to the unbalance design with where is the number of replicates for th rater.
Our main interest is to determine whether the considered raters can be used interchangeably. This can be determined by performing a hypothesis test on one of the three indices depending on the nature of the question. The hypothesis can be formed as one of the following,
If the null hypothesis is rejected, we can claim that the considered raters are interchangeable. Moreover, note here we use one-sided test instead of two-sided test, since the primary interest is to determine if the raters can be used interchangeably. For example, suppose the satisfactory OCP is 0.9 for an acceptable differencewhich means we will claim the raters are interchangeable if more than 90% of the MPD is within . One would not want to frame the hypothesis as , because we would reject the null hypothesis either with low or high OCP, e.g., when or . It would not make sense to claim the interchangeability among the raters when by rejecting the null. Therefore, hypotheses like (22), (23) or (24) make more sense for agreement studies.
We assess the performance of the new proposed overall indices by using simulated data from both normal and log-normal distributions. Suppose each subject is measured by three raters with. Let be the replicates measured by rater on subject and be vector of measurements of subject . Without loss of generosity, we simulated data so that the mean and covariance matrix of have the following forms
where is intra-rater variance, represent the correlation between rater and and . For log-normal data is generated by taking exponential transformation of a random vector from a multivariate normal distribution specified in Appendix E so that the resulting has the above form of mean and covariance matrix.
The agreement indices and the correlation matrix of from equation (15) cannot be expressed allegorically as functions of the mean and covariance matrix parameters for the specified normal and log-normal distributions. Therefore, numerical approximation is used to obtain the true values of the agreement indices and correlation matrix of given the true parameters of the normal and log-normal distributions. Specifically one simulated data set with a huge sample size of 100,000 is generated to represent the true population. True agreement indices are obtained by using the corresponding GEE estimators of the agreement indices based on this large data set and the true correlation matrix of is obtained by the observed sample correlation matrix of the observed .
we set acceptable difference of and for normal and log-normal data respectively in OCP, acceptable probability of in OTDI and maximum acceptable difference of and for normal and log-normal data respectively in RAUOCPC. To set the parameters for the normal and log-normal distributions, we leverage the intra- and inter-rater correlation and systematic shift between raters to achieve the following four different agreement scenarios and the resulting true values of OCP, OTDI and RAUOCPC are shown in table 2-4 for each scenarios.
High agreement: no systematic shift in means and high correlation and for
Moderate agreement: no systematic shift in means and low correlation and for
Mild agreement: systematic shift in means and high correlation and for
Low agreement: systematic shift in means and low correlation and for
For all four parameter scenarios, we set the intra-rater variability be and to represent some heterogeneity across the raters. Designs without replicates and with replicates of 3 are simulated and we consider sample size of 20, 30, 100, and 500. Together with the four different parameter settings, this resulted in a total of 48 simulation scenarios. For each scenario, a total of 10,000 simulated data sets are generated.
The performance of the proposed GEE approach are evaluated by reporting bias of estimated agreement indexes, mean square error(MSE), standard deviation(SD) of the 10,000 estimated agreement indexes and coverage rate (CR), where CR is defined as the percentage of estimated one-sided 95% confident interval cover the true value. Moreover, the standard errors of estimators are calculated using the true correlation matrix,, and independent working correlation matrix, , to confirm our theoretical result in Theorem 2.3. The results are shown in table 2, table 3 and table 4 for OCP, OTDI and RAUOCPC respectively.
In general, the simulation results show that the bias is negligible for both normal and non-normal data sets even when the sample size is as small as 20, since the proposed approach is unbiased and does not rely on the normality and homogeneity assumption. The results from data sets with replicates outperform the those for data without replicates in terms of bias and MSE. When sample size is small, the CR is closer to for data with the replicates than those without replicates. Moreover, for all three indices and different sample sizes, the absolute difference between the robust sandwich estimator with independent working correlation matrix and the one with true correlation matrix is less than or equal than 0.001 which confirms Theory 2.3.
For OCP, as shown in Table 2, the OCP varies from 50% to over 90% for different combination of correlation and mean values. For data sets without replicates, some OCP estimates are unidentifiable when the true OCP exceeds 80% and sample size is under 30, since all PMDs are smaller than the pre-determined satisfactory boundary . A reasonable CR around 94% can be achieved for such data sets for sample size of 100 or larger. While for the data sets with replicates, all OCP estimations are well defined and the 94.3% CR can be achieved for sample size of 20. As shown in Table 3, the true TDI varies from 1.3825 to 4.0512 for different combination of correlation and mean values. When sample size is small, the CR is over 96% for data sets without replicates which may due to the inaccuracy of estimating kernel function with limited sample. While a reasonable CR around 92% to 96% can be achieved for the data sets with replicates across all sample sizes. For RAUOCPC, we set and for normal and log-normal data respectively. As shown in Table 4, the true RAUOCPC varies from 0.3101 to 0.7719 for different combination of correlation and mean values. The RAUOCPC does not encounter the same problems when sample size is small as the OCP and OTDI. The CR is between 94% and 96% for all parameter scenarios.
Standard deviation of 10,000 estimated OCPs which should be close to the true standard error of the estimator
Mean estimated standard error of estimators from GEE with independent correlation matrix
Mean estimated standard error of estimators from GEE with true correlation matrix (This value is left blank when there is no replicates because it is the same as )
Number of simulations with estimated OCP equalling 100% which lends to undefined value in the logit function and the results are based on the outcomes without this issue
Standard deviation of 10,000 estimated OTDIs which should be close to the true standard error of the estimator
Mean estimated standard error of estimators from GEE with independent correlation matrix
Mean estimated standard error of estimators from GEE with true correlation matrix (This value is left blank when there is no replicates because it is the same as )
Standard deviation of 10,000 estimated RAUOCPC which is expected to be the true standard error of the estimator for a very large number of simulations
Mean estimated standard error of estimators from GEE with independent correlation matrix
Mean estimated standard error of estimators from GEE with true correlation matrix (This value is left blank when there is no replicates due to no difference from )
4 BP Example
The proposed indices and inference approach are illustrated with the systolic blood pressure data in the Bland and Altman’s paper [bland1986]. In this data example, the blood pressures of 85 patients were measured by three raters (two human observers J and R and one device S). Each raters measured every patients three times successively that can be treated as replicates.We assess the overall agreement among these three raters along with the intra-rater agreement within each raters as well as the pairwise inter agreement by OCP, OTDI and ORAUCPC with estimation and inference conducted by the proposed unified GEE approach.
The descriptive statistics of BP data is listed in table (5). We summarize the data by mean and stander deviation within each raters and assess the normality assumption of replicates from same rater and pairwise difference between any two raters by Doornik-Hansen’s test where a p-value less than 0.05 indicates a significant departure form a multivariate normal distribution. As shown in the Table 5 , the human rater S tends to have higher BP measurements with a average measurement of 143.04 mmHg than the other two raters whose numbers are around 127 mmHg. Moreover, based on , the rater S has lager within-rater variability than the raters J and R which implies that the heterogeneity among the raters exists. Furthermore, the p-value of Doornik-Hansen’s test for the measurement from each raters and the difference between raters are all less than 0.05 indicating that the normality assumption required for the estimation and inference approaches of unscaled indexes proposed by Lin[lin2007unified] and Jang et al.[jang2018overall] do not hold and their approaches are unsuitable for the BP dataset.
To assess the overall agreement among three raters, the new OCP, OTDI and RAUOCPC are used in analyzing the BP dataset. The satisfactory agreement is set based on the British hypertension society protocol (BHSP) for the evaluation of blood pressure measuring device [o1993british] shown in Table 5. For OCP, we set the pre-determined clinically meaningful acceptable difference to be mmHg and based on the criteria for grade C device the corresponding satisfactory OCP should be or higher. For OTDI, pre-determined acceptable probability is set to be and the satisfactory OTDI for grade C device is 15mmHg. For RAUOCPC, let mmHg and the satisfactory RAUOCPC is 0.59 which is computed based on overall coverage probability curve that connect points formed by the absolute differences of 0, 5, 10, 15, and 20 with the corresponding coverage probabilities for BP device of grade C with .
The estimated coverage probability curve is shown in Figure 2. The estimated OCP is 0.41 with 95% one-sided CI of (0.35,1) for three raters. Since the CI contains , we cannot reject the null hypothesis and thus there is no sufficient evidence to claim that three raters can be used interchangeably. We can come to the same conclusion with OTDI and RAUOCPC. The estimated OTDI is 30 with 95% one-sided CI of (0, 34.5) which contains and estimated RAUOCPC is 0.258 with 95% one-sided CI of (0.25,1) which contains . Therefore, based on the proposed overall agreement, three raters may not be used interchangeably in the sense that we are not confident that the measurements taken by three raters on the same patients are clinically similar.
To understand the source of disagreement and provide actionable results that guide readers to improve quality, we look into the pairwise inter-rater and intra-rater agreement between and within three raters, respectively. The results listed in Table 5 show that both the intra-rater agreement of human raters J and R and the inter-rater agreement between them are satisfactory. This implies that two human raters can be used interchangeably and the measurements from different nurses or different replicates from the same nurse are not likely to be clinically different. However, the agreement between human nurses and the deceive S is less satisfactory where the inter-rater OCPs (one sided 95% CI) are 0.51(0.45,1) and 0.51(0.45,1). Moreover, the repeatability of device S itself is also moderate with estimated intra-rater OCP of 0.84(0.78,1) and OTDI of 15(0,17.32). These results indicate that not only the device S is not in satisfactory agreement with the other raters but also its own replicates tend to have larger variability.
|Rater||Mean||P-Value for Normality Test|
|Estimation||95% CI||Estimation||95% CI||Estimation||95% CI|
|Overall||0.41||(0.35, 1)||30||(0, 34.46)||0.26||(0.25, 1)|
|Inter||J&R||0.94||(0.91, 1)||10||(0, 10.89)||0.76||(0.74, 1)|
|J&S||0.51||(0.45, 1)||28||(0, 32.47)||0.34||(0.33, 1)|
|R&S||0.51||(0.45, 1)||28||(0, 32.31)||0.35||(0.34, 1)|
|Intra||J||0.91||(0.87, 1)||12||(0, 13.48)||0.67||(0.65, 1)|
|R||0.92||(0.88, 1)||13||(0, 14.21)||0.66||(0.65, 1)|
|S||0.84||(0.78, 1)||15||(0, 17.32)||0.60||(0.59, 1)|
We have proposed a set of new indices (OCP, OTDI and RAUOCPC) for assessing overall agreement among among multiple raters. As an extension from the pairwise version of unscaled indices, the proposed indices are defined based on a new distance metric which measures the maximum pairwise difference among the raters. This metric allows the overall indices to preserve the intuitive interpretation from the pairwise version and directly employs the clinically information about satisfactory criteria. For example, we can extend clinically meaningful difference from the grading system of blood pressure device as the pre-determine boundary for OCP, since they both quantify the acceptable difference between two BP measurements. The OCP can be interpreted as the probability there is no clinically meaningful difference among measurements from all raters on the same subject.
The new proposed inference approach does not require distributional and homogeneity assumptions and therefore can be applied to various kinds of continuous measurements. As we discuss in Section 4, the BP data set [bland1986] is neither homogeneous nor normally distributed which are the assumptions the previously proposed inference approach. Moreover, the unified GEE approach could accommodate data with replicates and can be easily modified to carry out estimation and inference on pairwise, inter-rater and intra-rater agreements as we did in the BP example. The design with replicates is preferable since it can provide information on the repeatability of the raters. When the agreement is not satisfactory, intra-rater variability is a crucial source of disagreement and such information could provide guideline for future improvement of the testing raters. In addition to provide additional information, adding replication also could improve the performance of the estimator in terms of bias and CR as shown in our simulation studies (Table 2, 3 and 4). In practice, it tends to be easier and less costly by adding replicates than enrolling more subjects.
All proposed estimation and inference approaches can be easily applied by standard software and we also provide the R package for implementation. Based on the simulation results, the proposed approaches have limitation when the sample size is small and no replicates are available. For such scenario, parametric approaches can be an alternative after carefully verifying the assumptions. Moreover, it is of future interest to design agreement study based on the new proposed indices especially for design with replicates. As we discuss before, adding replicates can provide information on intra-rater agreement and improve the performance of estimators.
Appendix A Proof of Lemma 2.1
Let , and the cumulative distribution functions of and be and and the density functions be and respectively. Then,
Since is continuous at point , then
This implies that and therefor RAUCPC can be expressed in terms of .
Appendix B Proof of asymptotic distribution in estimating OTDI
In equation (19) for estimating OTDI, we propose to use . Let where with working correlation matrix . Then the left hand side of (19) is where . Let where . Then
Under mild regulations, by uniform strong law of large numbers[jung1996quasi], we have
Then we can write as
Suppose is the solution of such that . With from (47), then