1 Introduction
Indices are developed to assess the agreement among different raters or different measurements from a rater on the same subjects. Barnhart et al. categorized existing approaches for evaluation of agreement into descriptive tools, scaled summary indices attaining values between 1 and 1 and unscaled summary indices[barnhart2007overview]. She further elucidated that unscaled indices, such as coverage probability (CP) and total deviation index(TDI), are preferred for assessing the agreement in a core lab setting with following advantages: (1) they are simple to implement ; (2) they can be interpreted intuitively in terms of the original measurement unit; (3) the CP can provide actionable results that guide readers to identify the source of the agreement to improve their measurements[barnhart2016choice]. CP quantifies the chance of the difference between two measurements from two given raters on the same subject being less than a prefixed acceptable boundary . TDI, as a counterpart of CP, is the boundary where the difference of two measurement falls into with a prespecified confidence or probability. In practice, it may be desirable to set more than one acceptable/tolerable boundary up to a maximum acceptable difference with corresponding acceptable CPs. For example, the British hypertension society protocol (BHSP) for the evaluation of blood pressure measuring device [o1993british] shown in Table 1
. This protocol classifies the grade of blood pressure measuring devices by specify the satisfactory CPs for multiple prespecified differences (Table
1). Therefore, it is useful to summarize the agreement based on a coverage probability curve defined as the curve of coverage probability for a range of differences. A relative area under coverage probability curve (RAUCPC) was introduced as a summary index as a measure of agreement[raucpc].Prespecified Difference (mmHg)  

Grade  
Prespecified Coverage Probability  
A  60  85  95  100 
B  50  75  90  95 
C  40  65  85  90 
D  Fail to achieve C 
20mmHg is not included in the original protocol but added here as the maximum acceptable difference
CP, TDI and RAUCPC are all originally defined for two raters. However, there may be some competing new raters developed at the same time that need to be compared to each other or to an existing rater. Here we are interested in the interchangeability among more than two raters. For example, in the data set published in the Bland and Altman’s paper [bland1986], the blood pressures of 85 patients were measured by two human observers and one device. We are interested in whether these three raters (two human observers and one device) can be used interchangeably. Therefore, it is desirable to extend the CP, TDI and RAUCPC to measure agreement among multiple raters while preserving the intuitive interpretation of pairwise version. Lin et al.[lin2007unified] first extended the concept of CP and TDI to multiple raters using twoway mixed model and later Jang et al.[jang2018overall] proposed a new set of definitions based on the root mean square of pairwise differences(RMSPD). Although these overall agreement indices give some insight of the closeness of tested raters, they have limitations due to assumptions that may not hold in practice. First, the overall unscaled indices proposed by Lin is an approximate measure that are good only when following assumptions hold[lin2007unified]
: (1) the relative bias square is small; (2) the measurements follow normal distributions; (3) homogeneity across different raters. For overall indices proposed by Jang et al.
[jang2018overall], though the assumptions are relaxed, it still requires the difference from two measurements on the same subject is normally distributed. We will demonstrate these assumptions do not hold for the blood pressure measurements[bland1986] in section 4.In addition to the distributional assumption, the overall indices proposed by Jang et al.[jang2018overall] is difficult to have intuitive interpretation in practice. For example, for the overall CP(OCP) proposed by Jang et al., the satisfactory boundary is specified based on the root mean square of all pairwise differences among the raters. While an acceptable difference between two measurements can be chosen based on clinical implication, but it is not easy to choose an acceptable root mean square of differences in practice because its magnitude is difficult to interpret in terms of clinical judgement. Moreover, a satisfactory agreement through this OCP cannot guarantee the raters are interchangeabe. For example, if one rater has large departure from the rest raters and thus not interchangeable with others, but by averaging squared pairwise differences the resulting RMSPD can be acceptable and leading to claiming all raters are interchangeable. Similar problems could happen for their overall TDI and RAUCPC as well. Last but not the least, Jang’s method[jang2018overall] cannot be applied to the measurements by raters with replications and would need to be applied by restricting to one measurement per rater on each subject.
To address aforementioned issues, we propose new sets of overall CP, TDI and RAUCPC based on the maximum pairwise distances (MPD) to assess overall agreement among all considered raters. The new indices have intuitive interpretation in terms of original measurement unit and can be estimated by a unified nonparametric distributionfree paradigm based on Generalized Estimation Equations (GEE). The GEE approach could assess the inter and intrarater agreement simultaneously without normality and homogeneity assumptions. Moreover, under minimum set of assumptions, we show that the the estimator will achieve the semiparametric efficiency bound using the working independence covariance matrix. The paper is organized as follow. In Section 2, we first propose a new estimator of pairwise RAUCPC. Then we introduce the new definitions of overall CP, TDI and RAUCPC and with the new estimator of RAUCPC, we are able to develop a unified GEE approach for estimation and inference for all overall indices. We provide the simulation results in assessing the performance of the unified approach and illustrate the method with the example of the blood pressure data from Bland and Altman[bland1986] in Section 3 and Section 4 respectively. Finally, we draw conclusions and provide some discussions in Section 5.
2 Methods
We are interested in whether raters can be used interchangeably for making same type of measurement in a given population. For a subject randomly sampled from a population, we denote as the measurement taken by rater . The interchangeability among the raters can be based on a distance metric, , which reflects the closeness among measurements given by the raters. For , the distance metrics is defined as
(1) 
It is intuitive that a smaller distance implies a better agreement between two raters. For using these two raters interchangeably, one would like to have a high probability that this distance is within an acceptable difference. Therefore, the concept of coverage probability is used and it is defined as the probability that the distance falls within a boundary ,
(2) 
where
is the cumulative distribution function of
over the target population. The higher the CP(d), the better the agreement. To use CP to claim satisfactory agreement, we need to predefined a clinically acceptable boundary and the corresponding satisfactory probability . If is greater than or equal to , we can claim that two raters are interchangeable. The TDI, as a counterpart of CP, is defined as the boundary that proportion of the distances falls within(3) 
The smaller the TDI is, the better the agreement is. To use TDI to claim satisfactory agreement, we need to predefined a clinically acceptable probability and a satisfactory boundary . If is less than or equal to , then we claim that two raters are interchangeable.
In practice, it is sometimes desirable to control the quality on more than one prespecified differences, or we may simply want to summarize the agreement based on differences up to a maximum acceptable difference. This leads to consider a coverage probability curve . Relative area under the CP curve (RAUCPC) is proposed [barnhart2016choice] as a summary index of coverage probability curve by the scaled area under the coverage probability curve to a maximum acceptable difference . The is often chosen as . Specifically, RAUCPC is defined as
(4) 
RAUCPC ranges from 0 to 1 and a greater value indicates a better agreement. To use RAUCPC to claim satisfactory agreement, we need to prespecify an acceptable RAUCPC, . If RAUCPC is greater than or equal to , the we claim that two raters are interchangeable. It is not obvious how to choose and we can use the British hypertension society protocol to illustrate one way of choosing . As shown is Table 1, we can set . For grade C device, satisfactory coverage probabilities are specified for the absolute differences of 5, 10, 15, and 20. By linearly connecting these specific points, for , yields a satisfactory coverage probability curve for grade C device. Curves corresponding to Grade A, B and C are shown in Figure 1. The shaded area for Grade C is equal to 11.8 and the the RAUCPC is 0.59. Thus, one can use as the criterion for satisfactory Grade C device. Similar can be computed for claiming Grade A or Grade B device.
Jang et al.[jang2018overall] extended the pairwise CP, TDI and RAUCPC to multiple raters by defined the distance metrics as
(5) 
Their overall CP based on the was defined as
However, this definition may claim a satisfactory overall agreement and fail to detect the outlier raters due to the averaging. For example, suppose there are four competing raters where the first three raters give identical measurements and the fourth rater gives measurements greater than the other three raters by 5 on all subjects. Suppose the predefined clinical meaningful boundary is
. Then for all subjects implying and thus it would satisfactory agreement, even through the fourth rater gave clinically different outcomes on every subjects with the other three raters. Thus there is a need for a better distance metrics for assessing agreement among multiple raters2.1 Proposed Overall Agreement Indices
For two raters, aforementioned agreement indices defined interchangeability in the sense that whichever rater measures the subject the results are clinically similar. We want to extend this intuition of interchangeability for the situation with more than two raters. As a example, a patient walks into a local clinic and has his/her blood pressure measured by one of the multiple nurses there. The patient would expect that his/her blood pressure measurement will be similar no matter which nurse takes the measurement, in the sense that the nurses are interchangeable. Therefore, the largest difference between any two nurses should be clinically negligible. Following this idea, we define a new distance metric, maximum pairwise difference (MPD), among raters as
(6) 
The maximum pairwise difference could avoid the pitfalls brought by averaging all pairwise difference and can be reduced to the pairwise difference when . The new overall coverage probability (OCP) and overall totally deviation index (OTDI) based on MPD are defined as
(7) 
where is the cumulative distribution function of MPD. The OCP measures the probability that maximum pairwise difference among raters are less than a given acceptable boundary with clinical meaning. To use OCP to claim satisfactory agreement, we need to predefine a clinically acceptable boundary and the corresponding satisfactory probability . Since the PMD is in the same unit of pairwise distance, can be chosen similarly with the pairwise CP. If , then chance that the measurements given by J raters on the same subject are within distance is greater than . As a counterpart of OCP, OTDI captures the boundary that all possible pairwise differences among the raters of % subjects are fall into (OTDI(), OTDI()) and if OTDI() is smaller than the preset satisfactory boundary then we could claim that the raters are interchangeable.
When there are more than one acceptable boundary is of interest or we are interested in an aggregated agreement up to a maximum boundary , the overall relative area under the overall coverage probability curve, as an extension of RAUCPC, is defined as
(8) 
Like RAUCPC, RAUOCPC ranges from 0 to 1 and higher values indicates a good agreement. Since the MPD preserves the original unit of the measurement, we can use the clinical information from the pairwise version about the acceptable boundary for setting multiple boundaries for MPD. In this way, we could set the satisfactory RAUOCPC the same way as we did for the pairwise RAUCPC.
2.2 Estimation and Inference
We propose a unified distributionfree GEE approach to estimate and make inference on OCP, OTDI and RAUOCPC. Minimum assumptions are made as follows: (1) measurements of different subject are independent; (2) replicated measurements are i.i.d. given the same rater on the same subject. Since current estimators for RAUCPC proposed by Barnhart[raucpc] can not be expressed as a sum of function of each subjects, they are unable to fit in the GEE framework. Therefore, in this section, in order to have a unifed GEE model for RAUOCPC together with OCP and OTDI, we first propose a new unbiased nonparametric estimator for pairwise RAUCPC and RAUOCPC. Second, we present the unified GEE model and the inference approach.
2.2.1 Unbiased Nonparametric Estimator for RAUCPC
Barnhart[raucpc] proposed both parametric and nonparametric approaches for estimating RAUCPC. The parametric RAUCPC estimator is calculated based on the estimated density function of where the measurements are assumed to follow a normal distribution. While for the nonparametric estimator, suppose all distinct observations of are with the corresponding estimated values of . Let and , the estimated RAUCPC is equal to the area under the connected line based on trapesoid rule. It is clear that both estimators cannot be expressed as a sum of independent functions of individual subjects. Therefore, a new unbiased nonparametric estimator is developed below for our unified GEE framework.
The RAUCPC is defined as the scaled integration of cumulative distribution function of distance metric from 0 to a maximum acceptable boundary . By the role of integration by parts,
(9)  
(10)  
(11) 
We note that equation (11
) has the form of expectation for the following new random variable,
(12)  
(13) 
Then equation (11) can be expressed as which is proofed in the Appendix A. This leads to the following lemma.
Lemma 2.1
The relative area under coverage probability curve can be expressed as, i.e.,
(14) 
Similarly, for RAUOCPC with multiple raters, . Followed by the Lemma 2.1
, we can use moment estimator
for RAUCPC/RAUOCPC when there is no replications. This form of estimator can be easily incorporated into GEE framework when there are replicates.The performance of this new nonparametric estimator of RAUCPC is assessed via simulation and the results are presented in Appendix F. In general, for both normally and nonnormally distributed data, the new proposed nonparametric estimator has similar or better performance than the previous estimators[raucpc] in terms of bias and mean square error(MSE).
2.2.2 Unified Generalized Estimation Equation Approach
Let be the th successive replicate measured by th rater on th subject. It is reasonable to assume that successive replicates measured by same rater on same subject are equivalent where , are identically and independent distributed when we condition on th subject and th rater. This assumption implies that the unconditional distribution of has a distribution with an exchangeable correlation matrix where we denote , , is a matrix with 1 as elements and is the number of replicates for th rater. For simplicity, we assume that the number of replicates are equal to for all raters per subjects and it can be easily extended to unbalanced design. Since MPD is defined over a collection of J measurements with one from each rater on the same subject, when there are replications of each raters, we can get distinct collections of , . On each collection, we could compute the observed MPD and index it by . The MPD at the th collection, , is expressed as
(15) 
If , then this distance reduces to the distance for two raters mentioned above. For a random subject
, we have a random vector
where has the same marginal distribution .To develop a unified form, we first denote the agreement index of interest, OCP, OTDI or RAUOCPC, as and channel it with the parameter for estimation, , in GEE model with link function
. For OCP and RAUOCPC ranging from 0 to 1, logit transformation is used. For OTDI, since it is greater than 0, the natural log transformation is used. After transformation, the parameter of interest,
, in GEE model can range form to . Under the standard GEE framework[liang1986longitudinal], we need to find a function such that . Let be the indicator function, we choose the following corresponding to different agreement parameters of interest(16)  
(17)  
(18) 
Now we construct the generalized estimating equations system based on as follow,
(19) 
, , with working correlation matrix and where is the covariance matrix of . For OCP and RAUOCPC, we have which does not depend on .
For OTDI, we note that is not differentiable with regard to and a different definition of in equation (19) is needed. Rather than differentiating , we would differentiate its expectation, , where is the marginal cumulative distribution function of . Then, where is the marginal density function of at point . The marginal distribution, , is a nuisance parameter and needs to be estimated in order to solve equation (19).A smoothed kernel density[duong2019package] can be used and implemented by using a R function kde in R 3.1.1. With this new in (19), we show that the limiting distribution of still follows the general results of GEE in Appendix B.
Therefore, across three kinds of agreement indices, can be expressed as
(20) 
where for OCP and RAUOCPC and for OTDI.
Now we consider the specification of working correlation matrix with nuisance parameter . The optimal asymptotic efficiency of is achieved when coincides with the true correlation matrix of [wang2005effects]. A misspecified working correlation matrix could greatly compromised the efficiency especially when sample size is small and design is unbalance[wang2003working]. Therefore, generally a working correlation matrix resembling the truth is desirable. However, for our model, using the independent working correlation matrix will obtain the same efficiency (see theorem below) as the true correlation matrix. This may seem to be surprising, but it is due to the unique structure of true correlation matrix for agreement data as shown in the lemma below. Therefore, we will use the independent working correlation matrix in equation (19) without the need to estimate the nuisance parameter in the working correlation matrix.
Lemma 2.2
Given are i.i.d, the sum of elements in each row of correlation matrix of , , are equal.
The proof of lemma 2.2 is in Appendix C. Based on this lemma, we have the following theorem on efficiency.
Theorem 2.3
The GEE estimator obtained from equation (19) will achieve the same asymptotically statistical efficiency under either the true correlation matrix or the independent working correlation matrix. The limiting distribution of is
(21) 
where is the row sum of , and .
The proof is in Appendix D. The variance form in equation (21) is the simplification of the robust sandwich estimator after utilizing equation (20) and Lemma 2.2. The estimation from equation (19) and inference (21) can be obtained by standard statistical software that implements GEE. Moreover, the results can be easily extended to the unbalance design with where is the number of replicates for th rater.
Our main interest is to determine whether the considered raters can be used interchangeably. This can be determined by performing a hypothesis test on one of the three indices depending on the nature of the question. The hypothesis can be formed as one of the following,
(22)  
(23)  
(24) 
If the null hypothesis is rejected, we can claim that the considered raters are interchangeable. Moreover, note here we use onesided test instead of twosided test, since the primary interest is to determine if the raters can be used interchangeably. For example, suppose the satisfactory OCP is 0.9 for an acceptable difference
which means we will claim the raters are interchangeable if more than 90% of the MPD is within . One would not want to frame the hypothesis as , because we would reject the null hypothesis either with low or high OCP, e.g., when or . It would not make sense to claim the interchangeability among the raters when by rejecting the null. Therefore, hypotheses like (22), (23) or (24) make more sense for agreement studies.3 Simulation
We assess the performance of the new proposed overall indices by using simulated data from both normal and lognormal distributions. Suppose each subject is measured by three raters with
. Let be the replicates measured by rater on subject and be vector of measurements of subject . Without loss of generosity, we simulated data so that the mean and covariance matrix of have the following forms(25)  
(26)  
(27) 
where is intrarater variance, represent the correlation between rater and and . For lognormal data is generated by taking exponential transformation of a random vector from a multivariate normal distribution specified in Appendix E so that the resulting has the above form of mean and covariance matrix.
The agreement indices and the correlation matrix of from equation (15) cannot be expressed allegorically as functions of the mean and covariance matrix parameters for the specified normal and lognormal distributions. Therefore, numerical approximation is used to obtain the true values of the agreement indices and correlation matrix of given the true parameters of the normal and lognormal distributions. Specifically one simulated data set with a huge sample size of 100,000 is generated to represent the true population. True agreement indices are obtained by using the corresponding GEE estimators of the agreement indices based on this large data set and the true correlation matrix of is obtained by the observed sample correlation matrix of the observed .
we set acceptable difference of and for normal and lognormal data respectively in OCP, acceptable probability of in OTDI and maximum acceptable difference of and for normal and lognormal data respectively in RAUOCPC. To set the parameters for the normal and lognormal distributions, we leverage the intra and interrater correlation and systematic shift between raters to achieve the following four different agreement scenarios and the resulting true values of OCP, OTDI and RAUOCPC are shown in table 24 for each scenarios.

High agreement: no systematic shift in means and high correlation and for

Moderate agreement: no systematic shift in means and low correlation and for

Mild agreement: systematic shift in means and high correlation and for

Low agreement: systematic shift in means and low correlation and for
For all four parameter scenarios, we set the intrarater variability be and to represent some heterogeneity across the raters. Designs without replicates and with replicates of 3 are simulated and we consider sample size of 20, 30, 100, and 500. Together with the four different parameter settings, this resulted in a total of 48 simulation scenarios. For each scenario, a total of 10,000 simulated data sets are generated.
The performance of the proposed GEE approach are evaluated by reporting bias of estimated agreement indexes, mean square error(MSE), standard deviation(SD) of the 10,000 estimated agreement indexes and coverage rate (CR), where CR is defined as the percentage of estimated onesided 95% confident interval cover the true value. Moreover, the standard errors of estimators are calculated using the true correlation matrix,
, and independent working correlation matrix, , to confirm our theoretical result in Theorem 2.3. The results are shown in table 2, table 3 and table 4 for OCP, OTDI and RAUOCPC respectively.In general, the simulation results show that the bias is negligible for both normal and nonnormal data sets even when the sample size is as small as 20, since the proposed approach is unbiased and does not rely on the normality and homogeneity assumption. The results from data sets with replicates outperform the those for data without replicates in terms of bias and MSE. When sample size is small, the CR is closer to for data with the replicates than those without replicates. Moreover, for all three indices and different sample sizes, the absolute difference between the robust sandwich estimator with independent working correlation matrix and the one with true correlation matrix is less than or equal than 0.001 which confirms Theory 2.3.
For OCP, as shown in Table 2, the OCP varies from 50% to over 90% for different combination of correlation and mean values. For data sets without replicates, some OCP estimates are unidentifiable when the true OCP exceeds 80% and sample size is under 30, since all PMDs are smaller than the predetermined satisfactory boundary . A reasonable CR around 94% can be achieved for such data sets for sample size of 100 or larger. While for the data sets with replicates, all OCP estimations are well defined and the 94.3% CR can be achieved for sample size of 20. As shown in Table 3, the true TDI varies from 1.3825 to 4.0512 for different combination of correlation and mean values. When sample size is small, the CR is over 96% for data sets without replicates which may due to the inaccuracy of estimating kernel function with limited sample. While a reasonable CR around 92% to 96% can be achieved for the data sets with replicates across all sample sizes. For RAUOCPC, we set and for normal and lognormal data respectively. As shown in Table 4, the true RAUOCPC varies from 0.3101 to 0.7719 for different combination of correlation and mean values. The RAUOCPC does not encounter the same problems when sample size is small as the OCP and OTDI. The CR is between 94% and 96% for all parameter scenarios.
nsub  nrep  Corr  Shift  Normal  Lognormal  
True  True  
OCP  Bias  ^{a}  ^{b}  ^{c}  MSE  CR  nmiss ^{d}  OCP  Bias  ^{a}  ^{b}  ^{c}  MSE  CR  nmiss^{d}  
20  1  high  No  0.9412  0.0008  0.4976  0.8738  0.0027  100.0%  3000  0.9444  0.0007  0.4949  0.8787  0.0026  100.0%  3284  
20  3  high  No  0.9412  0.0003  0.6777  0.5875  0.5730  0.0010  90.4%  0  0.9444  0.0054  0.7850  0.7597  0.7418  0.0015  95.8%  0 
50  1  high  No  0.9412  0.0003  0.5869  0.6520  0.0011  92.3%  464  0.9444  0.0001  0.5826  0.6673  0.0011  93.8%  576  
50  3  high  No  0.9412  0.0001  0.4071  0.3731  0.3696  0.0004  93.0%  0  0.9444  0.0001  0.5821  0.5071  0.5031  0.0007  93.5%  0 
100  1  high  No  0.9412  0.0000  0.4770  0.4553  0.0006  92.8%  24  0.9444  0.0002  0.4860  0.4695  0.0005  94.9%  39  
100  3  high  No  0.9412  0.0000  0.2724  0.2634  0.2623  0.0002  94.0%  0  0.9444  0.0003  0.3653  0.3489  0.3479  0.0003  95.2%  0 
500  1  high  No  0.9412  0.0001  0.1953  0.1925  0.0001  95.4%  0  0.9444  0.0001  0.1988  0.1977  0.0001  95.1%  0  
500  3  high  No  0.9412  0.0001  0.1169  0.1177  0.1176  0.0000  95.3%  0  0.9444  0.0001  0.1547  0.1534  0.1536  0.0001  95.1%  0 
20  1  low  No  0.8066  0.0013  0.6119  0.6091  0.0080  98.5%  147  0.8970  0.0003  0.6022  0.7696  0.0046  100.0%  1154  
20  3  low  No  0.8066  0.0004  0.3386  0.3279  0.3201  0.0026  93.9%  0  0.8970  0.0003  0.5979  0.5398  0.5267  0.0022  93.3%  0 
50  1  low  No  0.8066  0.0005  0.3855  0.3696  0.0031  95.1%  0  0.8970  0.0001  0.5236  0.5011  0.0019  93.0%  33  
50  3  low  No  0.8066  0.0001  0.2080  0.2065  0.2048  0.0010  94.4%  0  0.8970  0.0001  0.3481  0.3316  0.3286  0.0009  94.6%  0 
100  1  low  No  0.8066  0.0004  0.2601  0.2568  0.0016  93.7%  0  0.8970  0.0002  0.3509  0.3409  0.0009  95.4%  0  
100  3  low  No  0.8066  0.0000  0.1469  0.1459  0.1454  0.0005  94.7%  0  0.8970  0.0002  0.2357  0.2320  0.2311  0.0005  95.2%  0 
500  1  low  No  0.8066  0.0000  0.1147  0.1136  0.0003  95.0%  0  0.8970  0.0000  0.1473  0.1480  0.0002  94.6%  0  
500  3  low  No  0.8066  0.0000  0.0651  0.0651  0.0651  0.0001  95.0%  0  0.8970  0.0000  0.1029  0.1030  0.1030  0.0001  95.0%  0 
20  1  high  Yes  0.6457  0.0015  0.4974  0.4845  0.0112  96.3%  3  0.7898  0.0007  0.5991  0.5904  0.0083  99.0%  96  
20  3  high  Yes  0.6457  0.0005  0.3494  0.3480  0.3395  0.0060  95.0%  0  0.7898  0.0000  0.4748  0.4452  0.4344  0.0051  94.9%  0 
50  1  high  Yes  0.6457  0.0001  0.3040  0.2997  0.0046  95.2%  0  0.7898  0.0000  0.3649  0.3570  0.0033  95.4%  1  
50  3  high  Yes  0.6457  0.0005  0.2171  0.2170  0.2150  0.0024  95.0%  0  0.7898  0.0004  0.2794  0.2744  0.2719  0.0020  95.1%  0 
100  1  high  Yes  0.6457  0.0004  0.2102  0.2104  0.0023  95.3%  0  0.7898  0.0001  0.2509  0.2487  0.0016  94.3%  0  
100  3  high  Yes  0.6457  0.0003  0.1522  0.1527  0.1521  0.0012  95.2%  0  0.7898  0.0005  0.1935  0.1925  0.1917  0.0010  95.1%  0 
500  1  high  Yes  0.6457  0.0000  0.0924  0.0936  0.0004  94.7%  0  0.7898  0.0000  0.1105  0.1100  0.0003  95.7%  0  
500  3  high  Yes  0.6457  0.0000  0.0675  0.0681  0.0681  0.0002  95.3%  0  0.7898  0.0000  0.0852  0.0856  0.0856  0.0002  95.0%  0 
20  1  low  Yes  0.5397  0.0011  0.4706  0.4615  0.0122  95.5%  0  0.6802  0.0012  0.5230  0.5004  0.0108  97.5%  6  
20  3  low  Yes  0.5397  0.0012  0.2922  0.2935  0.2863  0.0050  95.4%  0  0.6802  0.0001  0.3632  0.3532  0.3445  0.0057  94.6%  0 
50  1  low  Yes  0.5397  0.0005  0.2903  0.2868  0.0050  93.6%  0  0.6802  0.0002  0.3094  0.3078  0.0043  95.3%  0  
50  3  low  Yes  0.5397  0.0004  0.1843  0.1841  0.1825  0.0021  95.2%  0  0.6802  0.0004  0.2235  0.2204  0.2183  0.0023  95.1%  0 
100  1  low  Yes  0.5397  0.0003  0.2022  0.2017  0.0025  95.3%  0  0.6802  0.0002  0.2186  0.2161  0.0022  94.6%  0  
100  3  low  Yes  0.5397  0.0003  0.1298  0.1297  0.1292  0.0010  94.7%  0  0.6802  0.0005  0.1545  0.1550  0.1543  0.0011  95.2%  0 
500  1  low  Yes  0.5397  0.0001  0.0899  0.0898  0.0005  94.9%  0  0.6802  0.0001  0.0958  0.0960  0.0004  95.2%  0  
500  3  low  Yes  0.5397  0.0001  0.0573  0.0579  0.0579  0.0002  95.3%  0  0.6802  0.0001  0.0685  0.0691  0.0691  0.0002  95.2%  0 

Standard deviation of 10,000 estimated OCPs which should be close to the true standard error of the estimator

Mean estimated standard error of estimators from GEE with independent correlation matrix

Mean estimated standard error of estimators from GEE with true correlation matrix (This value is left blank when there is no replicates because it is the same as )

Number of simulations with estimated OCP equalling 100% which lends to undefined value in the logit function and the results are based on the outcomes without this issue
nsub  nrep  correlation  Shift  Normal  Lognormal  
True  True  
OTDI  Bias  ^{a}  ^{b}  ^{c}  MSE  CR  OTDI  Bias  ^{a}  ^{b}  ^{c}  MSE  CR  
20  1  High  No  2.2455  0.0852  0.1380  0.1657  0.0948  91.6%  1.3808  0.0658  0.3113  0.3401  0.1853  87.8%  
20  3  High  No  2.2455  0.0003  0.0896  0.0938  0.0938  0.0405  93.3%  1.3808  0.0690  0.2554  0.2670  0.2671  0.1547  93.2% 
50  1  High  No  2.2455  0.0328  0.0901  0.1004  0.0406  93.0%  1.3808  0.0222  0.2044  0.2134  0.0811  90.7%  
50  3  High  No  2.2455  0.0014  0.0587  0.0595  0.0595  0.0173  93.7%  1.3808  0.0137  0.1673  0.1654  0.1655  0.0562  92.1% 
100  1  High  No  2.2455  0.0155  0.0626  0.0693  0.0197  94.2%  1.3808  0.0126  0.1468  0.1497  0.0411  91.7%  
100  3  High  No  2.2455  0.0016  0.0407  0.0419  0.0419  0.0084  94.6%  1.3808  0.0093  0.1185  0.1173  0.1173  0.0278  93.2% 
500  1  High  No  2.2455  0.0020  0.0283  0.0297  0.0040  95.2%  1.3808  0.0017  0.0651  0.0662  0.0081  93.8%  
500  3  High  No  2.2455  0.0020  0.0182  0.0186  0.0186  0.0017  95.4%  1.3808  0.0017  0.0523  0.0526  0.0526  0.0052  94.6% 
20  1  Low  No  2.9660  0.1058  0.1404  0.1677  0.1702  91.4%  2.0300  0.0935  0.2820  0.3068  0.3296  87.8%  
20  3  Low  No  2.9660  0.0016  0.0813  0.0838  0.0839  0.0581  92.8%  2.0300  0.0272  0.2060  0.2069  0.2072  0.1883  90.9% 
50  1  Low  No  2.9660  0.0430  0.0896  0.1008  0.0703  92.5%  2.0300  0.0390  0.1820  0.1923  0.1374  91.1%  
50  3  Low  No  2.9660  0.0002  0.0512  0.0529  0.0530  0.0230  94.1%  2.0300  0.0119  0.1340  0.1316  0.1317  0.0769  92.3% 
100  1  Low  No  2.9660  0.0201  0.0628  0.0696  0.0346  94.1%  2.0300  0.0198  0.1305  0.1351  0.0703  91.9%  
100  3  Low  No  2.9660  0.0010  0.0366  0.0373  0.0373  0.0118  94.1%  2.0300  0.0090  0.0932  0.0933  0.0934  0.0366  93.7% 
500  1  Low  No  2.9660  0.0024  0.0285  0.0299  0.0071  95.0%  2.0300  0.0024  0.0588  0.0595  0.0142  93.7%  
500  3  Low  No  2.9660  0.0024  0.0163  0.0165  0.0166  0.0024  95.2%  2.0300  0.0024  0.0412  0.0417  0.0418  0.0070  94.7% 
20  1  High  Yes  3.5219  0.0995  0.1009  0.1224  0.1275  92.2%  3.0327  0.0703  0.0973  0.1059  0.0904  87.9%  
20  3  High  Yes  3.5219  0.0089  0.0726  0.0750  0.0751  0.0646  92.7%  3.0327  0.0029  0.0807  0.0785  0.0785  0.0613  89.9% 
50  1  High  Yes  3.5219  0.0395  0.0639  0.0734  0.0508  93.3%  3.0327  0.0292  0.0630  0.0660  0.0371  90.6%  
50  3  High  Yes  3.5219  0.0006  0.0459  0.0474  0.0474  0.0260  94.1%  3.0327  0.0034  0.0506  0.0500  0.0501  0.0238  92.2% 
100  1  High  Yes  3.5219  0.0195  0.0454  0.0503  0.0257  94.2%  3.0327  0.0133  0.0455  0.0463  0.0192  91.7%  
100  3  High  Yes  3.5219  0.0004  0.0327  0.0333  0.0333  0.0133  94.5%  3.0327  0.0034  0.0355  0.0355  0.0355  0.0116  93.3% 
500  1  High  Yes  3.5219  0.0020  0.0204  0.0216  0.0052  95.2%  3.0327  0.0016  0.0202  0.0205  0.0038  93.8%  
500  3  High  Yes  3.5219  0.0020  0.0145  0.0148  0.0148  0.0026  95.3%  3.0327  0.0016  0.0157  0.0159  0.0159  0.0023  94.7% 
20  1  Low  Yes  4.0533  0.1252  0.1152  0.1385  0.2171  92.0%  3.4661  0.0987  0.1201  0.1317  0.1786  88.0%  
20  3  Low  Yes  4.0533  0.0090  0.0720  0.0744  0.0745  0.0845  93.1%  3.4661  0.0038  0.0899  0.0882  0.0882  0.0991  90.4% 
50  1  Low  Yes  4.0533  0.0509  0.0719  0.0823  0.0848  93.4%  3.4661  0.0386  0.0774  0.0820  0.0731  90.8%  
50  3  Low  Yes  4.0533  0.0009  0.0456  0.0470  0.0470  0.0340  94.1%  3.4661  0.0039  0.0569  0.0562  0.0563  0.0393  92.5% 
100  1  Low  Yes  4.0533  0.0242  0.0510  0.0566  0.0428  94.3%  3.4661  0.0198  0.0556  0.0572  0.0373  91.7%  
100  3  Low  Yes  4.0533  0.0004  0.0324  0.0330  0.0330  0.0172  94.5%  3.4661  0.0043  0.0400  0.0399  0.0399  0.0193  93.5% 
500  1  Low  Yes  4.0533  0.0021  0.0232  0.0242  0.0089  95.0%  3.4661  0.0020  0.0249  0.0254  0.0075  94.1%  
500  3  Low  Yes  4.0533  0.0021  0.0144  0.0147  0.0147  0.0034  95.2%  3.4661  0.0020  0.0177  0.0178  0.0179  0.0038  94.7% 

Standard deviation of 10,000 estimated OTDIs which should be close to the true standard error of the estimator

Mean estimated standard error of estimators from GEE with independent correlation matrix

Mean estimated standard error of estimators from GEE with true correlation matrix (This value is left blank when there is no replicates because it is the same as )
nsub  nrep  Corr  Shift  Normal  Lognormal  

True  True  
RAUOCPC  Bias  ^{a}  ^{b}  ^{b}  MSE  CR  RAUOCPC  Bias  ^{a}  ^{b}  ^{c}  MSE  CR  
20  1  High  No  0.6084  0.0004  0.1923  0.1910  0.0021  95.1%  0.7720  0.0002  0.2996  0.2835  0.0027  96.2%  
20  3  High  No  0.6084  0.0008  0.1397  0.1368  0.1477  0.0011  93.5%  0.7720  0.0058  0.2460  0.2543  0.2549  0.0020  95.1% 
50  1  High  No  0.6084  0.0002  0.1219  0.1215  0.0008  95.4%  0.7720  0.0002  0.1859  0.1839  0.0011  96.0%  
50  3  High  No  0.6084  0.0002  0.0902  0.0886  0.0981  0.0005  93.7%  0.7720  0.0001  0.1679  0.1620  0.1627  0.0009  92.5% 
100  1  High  No  0.6084  0.0001  0.0855  0.0860  0.0004  95.1%  0.7720  0.0000  0.1323  0.1309  0.0005  95.6%  
100  3  High  No  0.6084  0.0001  0.0633  0.0631  0.0710  0.0002  94.1%  0.7720  0.0004  0.1179  0.1159  0.1166  0.0004  93.9% 
500  1  High  No  0.6084  0.0000  0.0382  0.0385  0.0001  95.6%  0.7720  0.0000  0.0588  0.0588  0.0001  95.4%  
500  3  High  No  0.6084  0.0000  0.0281  0.0284  0.0331  0.0000  95.1%  0.7720  0.0000  0.0522  0.0523  0.0526  0.0001  94.5% 
20  1  Low  No  0.4911  0.0006  0.2288  0.2280  0.0032  95.6%  0.6784  0.0001  0.2790  0.2718  0.0036  96.0%  
20  3  Low  No  0.4911  0.0005  0.1448  0.1403  0.1409  0.0013  93.5%  0.6784  0.0003  0.2172  0.2083  0.2090  0.0022  92.2% 
50  1  Low  No  0.4911  0.0004  0.1424  0.1436  0.0013  95.3%  0.6784  0.0003  0.1721  0.1733  0.0014  95.8%  
50  3  Low  No  0.4911  0.0002  0.0907  0.0903  0.0908  0.0005  94.1%  0.6784  0.0000  0.1390  0.1346  0.1350  0.0009  93.3% 
100  1  Low  No  0.4911  0.0002  0.1009  0.1014  0.0006  95.2%  0.6784  0.0001  0.1238  0.1227  0.0007  95.5%  
100  3  Low  No  0.4911  0.0001  0.0645  0.0643  0.0646  0.0003  94.2%  0.6784  0.0003  0.0972  0.0959  0.0962  0.0004  94.4% 
500  1  Low  No  0.4911  0.0000  0.0450  0.0453  0.0001  95.4%  0.6784  0.0000  0.0547  0.0549  0.0001  95.5%  
500  3  Low  No  0.4911  0.0000  0.0288  0.0289  0.0290  0.0001  94.9%  0.6784  0.0000  0.0427  0.0431  0.0433  0.0001  94.7% 
20  1  High  Yes  0.3583  0.0005  0.2347  0.2348  0.0028  95.0%  0.3943  0.0002  0.1724  0.1716  0.0017  96.3%  
20  3  High  Yes  0.3583  0.0004  0.1936  0.1886  0.1912  0.0019  94.5%  0.3943  0.0001  0.1504  0.1444  0.1457  0.0013  92.7% 
50  1  High  Yes  0.3583  0.0001  0.1489  0.1479  0.0012  95.0%  0.3943  0.0000  0.1083  0.1087  0.0007  96.0%  
50  3  High  Yes  0.3583  0.0002  0.1230  0.1210  0.1227  0.0008  94.5%  0.3943  0.0004  0.0947  0.0930  0.0940  0.0005  94.0% 
100  1  High  Yes  0.3583  0.0001  0.1038  0.1044  0.0006  95.1%  0.3943  0.0002  0.0760  0.0769  0.0003  95.9%  
100  3  High  Yes  0.3583  0.0001  0.0862  0.0859  0.0872  0.0004  94.5%  0.3943  0.0003  0.0668  0.0662  0.0669  0.0003  94.2% 
500  1  High  Yes  0.3583  0.0000  0.0462  0.0466  0.0001  95.4%  0.3943  0.0000  0.0343  0.0344  0.0001  95.2%  
500  3  High  Yes  0.3583  0.0000  0.0383  0.0385  0.0391  0.0001  94.8%  0.3943  0.0000  0.0298  0.0297  0.0300  0.0001  94.7% 
20  1  Low  Yes  0.3100  0.0003  0.2772  0.2774  0.0033  95.3%  0.3559  0.0000  0.2174  0.2166  0.0024  96.0%  
20  3  Low  Yes  0.3100  0.0005  0.1967  0.1917  0.1921  0.0017  94.6%  0.3559  0.0001  0.1731  0.1661  0.1667  0.0015  93.6% 
50  1  Low  Yes  0.3100  0.0001  0.1746  0.1740  0.0014  95.0%  0.3559  0.0001  0.1351  0.1363  0.0009  95.5%  
50  3  Low  Yes  0.3100  0.0001  0.1249  0.1230  0.1232  0.0007  94.5%  0.3559  0.0005  0.1090  0.1066  0.1070  0.0006  94.3% 
100  1  Low  Yes  0.3100  0.0000  0.1222  0.1226  0.0007  95.0%  0.3559  0.0003  0.0956  0.0963  0.0005  95.5%  
100  3  Low  Yes  0.3100  0.0001  0.0876  0.0873  0.0875  0.0003  94.4%  0.3559  0.0003  0.0765  0.0759  0.0761  0.0003  94.5% 
500  1  Low  Yes  0.3100  0.0000  0.0552  0.0547  0.0001  95.0%  0.3559  0.0000  0.0429  0.0430  0.0001  95.2%  
500  3  Low  Yes  0.3100  0.0000  0.0389  0.0391  0.0392  0.0001  94.9%  0.3559  0.0000  0.0339  0.0340  0.0341  0.0001  94.8% 

Standard deviation of 10,000 estimated RAUOCPC which is expected to be the true standard error of the estimator for a very large number of simulations

Mean estimated standard error of estimators from GEE with independent correlation matrix

Mean estimated standard error of estimators from GEE with true correlation matrix (This value is left blank when there is no replicates due to no difference from )
4 BP Example
The proposed indices and inference approach are illustrated with the systolic blood pressure data in the Bland and Altman’s paper [bland1986]. In this data example, the blood pressures of 85 patients were measured by three raters (two human observers J and R and one device S). Each raters measured every patients three times successively that can be treated as replicates.We assess the overall agreement among these three raters along with the intrarater agreement within each raters as well as the pairwise inter agreement by OCP, OTDI and ORAUCPC with estimation and inference conducted by the proposed unified GEE approach.
The descriptive statistics of BP data is listed in table (
5). We summarize the data by mean and stander deviation within each raters and assess the normality assumption of replicates from same rater and pairwise difference between any two raters by DoornikHansen’s test where a pvalue less than 0.05 indicates a significant departure form a multivariate normal distribution. As shown in the Table 5 , the human rater S tends to have higher BP measurements with a average measurement of 143.04 mmHg than the other two raters whose numbers are around 127 mmHg. Moreover, based on , the rater S has lager withinrater variability than the raters J and R which implies that the heterogeneity among the raters exists. Furthermore, the pvalue of DoornikHansen’s test for the measurement from each raters and the difference between raters are all less than 0.05 indicating that the normality assumption required for the estimation and inference approaches of unscaled indexes proposed by Lin[lin2007unified] and Jang et al.[jang2018overall] do not hold and their approaches are unsuitable for the BP dataset.To assess the overall agreement among three raters, the new OCP, OTDI and RAUOCPC are used in analyzing the BP dataset. The satisfactory agreement is set based on the British hypertension society protocol (BHSP) for the evaluation of blood pressure measuring device [o1993british] shown in Table 5. For OCP, we set the predetermined clinically meaningful acceptable difference to be mmHg and based on the criteria for grade C device the corresponding satisfactory OCP should be or higher. For OTDI, predetermined acceptable probability is set to be and the satisfactory OTDI for grade C device is 15mmHg. For RAUOCPC, let mmHg and the satisfactory RAUOCPC is 0.59 which is computed based on overall coverage probability curve that connect points formed by the absolute differences of 0, 5, 10, 15, and 20 with the corresponding coverage probabilities for BP device of grade C with .
The estimated coverage probability curve is shown in Figure 2. The estimated OCP is 0.41 with 95% onesided CI of (0.35,1) for three raters. Since the CI contains , we cannot reject the null hypothesis and thus there is no sufficient evidence to claim that three raters can be used interchangeably. We can come to the same conclusion with OTDI and RAUOCPC. The estimated OTDI is 30 with 95% onesided CI of (0, 34.5) which contains and estimated RAUOCPC is 0.258 with 95% onesided CI of (0.25,1) which contains . Therefore, based on the proposed overall agreement, three raters may not be used interchangeably in the sense that we are not confident that the measurements taken by three raters on the same patients are clinically similar.
To understand the source of disagreement and provide actionable results that guide readers to improve quality, we look into the pairwise interrater and intrarater agreement between and within three raters, respectively. The results listed in Table 5 show that both the intrarater agreement of human raters J and R and the interrater agreement between them are satisfactory. This implies that two human raters can be used interchangeably and the measurements from different nurses or different replicates from the same nurse are not likely to be clinically different. However, the agreement between human nurses and the deceive S is less satisfactory where the interrater OCPs (one sided 95% CI) are 0.51(0.45,1) and 0.51(0.45,1). Moreover, the repeatability of device S itself is also moderate with estimated intrarater OCP of 0.84(0.78,1) and OTDI of 15(0,17.32). These results indicate that not only the device S is not in satisfactory agreement with the other raters but also its own replicates tend to have larger variability.
Rater  Mean  PValue for Normality Test  

J  127.4  5.3  0.004 
R  127.3  5.4  0.008 
S  143.0  7.0  <0.001 
CP  TDI  RAUCPC  

Estimation  95% CI  Estimation  95% CI  Estimation  95% CI  
Overall  0.41  (0.35, 1)  30  (0, 34.46)  0.26  (0.25, 1)  
Inter  J&R  0.94  (0.91, 1)  10  (0, 10.89)  0.76  (0.74, 1) 
J&S  0.51  (0.45, 1)  28  (0, 32.47)  0.34  (0.33, 1)  
R&S  0.51  (0.45, 1)  28  (0, 32.31)  0.35  (0.34, 1)  
Intra  J  0.91  (0.87, 1)  12  (0, 13.48)  0.67  (0.65, 1) 
R  0.92  (0.88, 1)  13  (0, 14.21)  0.66  (0.65, 1)  
S  0.84  (0.78, 1)  15  (0, 17.32)  0.60  (0.59, 1) 
5 Discussion
We have proposed a set of new indices (OCP, OTDI and RAUOCPC) for assessing overall agreement among among multiple raters. As an extension from the pairwise version of unscaled indices, the proposed indices are defined based on a new distance metric which measures the maximum pairwise difference among the raters. This metric allows the overall indices to preserve the intuitive interpretation from the pairwise version and directly employs the clinically information about satisfactory criteria. For example, we can extend clinically meaningful difference from the grading system of blood pressure device as the predetermine boundary for OCP, since they both quantify the acceptable difference between two BP measurements. The OCP can be interpreted as the probability there is no clinically meaningful difference among measurements from all raters on the same subject.
The new proposed inference approach does not require distributional and homogeneity assumptions and therefore can be applied to various kinds of continuous measurements. As we discuss in Section 4, the BP data set [bland1986] is neither homogeneous nor normally distributed which are the assumptions the previously proposed inference approach. Moreover, the unified GEE approach could accommodate data with replicates and can be easily modified to carry out estimation and inference on pairwise, interrater and intrarater agreements as we did in the BP example. The design with replicates is preferable since it can provide information on the repeatability of the raters. When the agreement is not satisfactory, intrarater variability is a crucial source of disagreement and such information could provide guideline for future improvement of the testing raters. In addition to provide additional information, adding replication also could improve the performance of the estimator in terms of bias and CR as shown in our simulation studies (Table 2, 3 and 4). In practice, it tends to be easier and less costly by adding replicates than enrolling more subjects.
All proposed estimation and inference approaches can be easily applied by standard software and we also provide the R package for implementation. Based on the simulation results, the proposed approaches have limitation when the sample size is small and no replicates are available. For such scenario, parametric approaches can be an alternative after carefully verifying the assumptions. Moreover, it is of future interest to design agreement study based on the new proposed indices especially for design with replicates. As we discuss before, adding replicates can provide information on intrarater agreement and improve the performance of estimators.
References
Appendix A Proof of Lemma 2.1
Let , and the cumulative distribution functions of and be and and the density functions be and respectively. Then,
(28)  
(29)  
(30)  
(31) 
Thus,
(32)  
(33) 
Therefore,
(34)  
(35)  
(36) 
Since is continuous at point , then
(38)  
(39)  
(40)  
(41) 
This implies that and therefor RAUCPC can be expressed in terms of .
(42)  
(43)  
(44) 
Appendix B Proof of asymptotic distribution in estimating OTDI
In equation (19) for estimating OTDI, we propose to use . Let where with working correlation matrix . Then the left hand side of (19) is where . Let where . Then
(45)  
(46) 
Under mild regulations, by uniform strong law of large numbers
[jung1996quasi], we have(47) 
Then we can write as
(48)  
(49)  
(50) 
Suppose is the solution of such that . With from (47), then
(51) 
where,
Comments
There are no comments yet.