Consider the simple linear regression, where, for
. Compared to ordinary least squares regression (OLS), non-parametric estimation of the regression coefficientsand
can be more robust. Deviations from normal distributed error terms, like heavy tails or outliers, do not disturb the estimation as strongly as for OLS. A good introduction to robust estimation is given in.
Here we consider the non-parametric Theil-Sen regression (TSR) and Passing-Bablok regression (PBR), which are both based on Kendall’s rank correlation  and provide a robust estimate of . PBR and TSR do not rely on the assumption of normally distributed errors. The Theil-Sen estimate [11, 10] for is given by the median of all slopes of the connecting lines between pairwise measurements. If measurement errors occur for both and , the TSR is biased towards zero. This phenomenon is also known from OLS and called regression dilution or attenuation. For least squares, Deming regression instead of OLS can be used to account for errors in both variables. In addition,  suggest how to use repeated measurements to correct for regression dilution, if errors are normally distributed. A variation of the Theil-Sen regression that accounts for errors in both variables is the Passing-Bablok estimator [8, 9]. The Passing-Bablok estimate is also given by the median of all slopes, but shifted by an offset to ensure that and are interchangeable. Interestingly, PBR and TSR seem to be popular in separate fields. While TSR is popular in Metrology and Environmental Science, PBR is mainly used in Clinical Biochemistry, Pharmacology and Laboratory Medicine to compare two alternative measurement methods. This might be due to the fact that PBR is outlined in a guideline of the Clinical and Laboratory Standards Institute . In  a protocol for method comparison studies in clinical laboratories is suggested, including a paragraph on PBR. Furthermore, there have been attempts to use PBR for batch effect removal in gene expression analysis . Throughout this manuscript we will refer to the Passing-Bablok regression, but note that setting in the PBR equals the TSR and the presented results naturally hold for both methods.
We consider the case, where repeated measurements of two methods, for true pairs are available. The alternative measurements of the two methods are given by data points , . This scenario includes simultaneous repeated measurements of the same sample with two methods, as well as multiple measurements of the same sample separately measured for each method. In the later, measurements need to be combined randomly into observation pairs . If the data set for a PBR contains repeated measurements it is important to account for this. In contrast to the expected slope between independent measurements, the expected slope between repeated measurements is . Hence, the slopes between points that correspond to repeated measurements of the same underlying true value are meaningless and would distort PB estimates and lower the power of the associated statistical test if included. Here we describe how the variance changes if we omit the meaningless slopes between repeated measurements and provide the resulting test on the equivalence of two methods.
states our results, including the asymptotic confidence interval for the estimated parameterin Corollary 3.4. The statistical test for the equivalence of two methods with repeated measurements is given in Corollary 3.6. In section 4 we discuss the implications of the suggested Block-Passing-Bablok procedure and compare it empirically to the PBR without repeated measurements. Finally, proofs are given in section 5.
2 The Passing-Bablok regression for repeated measurements
2.1 The standard Passing-Bablok regression for independent measurements
Under the hypotheses of a structural linear relationship between two measurement methods, i.e.
for , where corresponds to one and to the other method, Passing and Bablok  considered the following assumptions:
(standard Passing-Bablok Regression)
All points are of the form
with the (non-random) ’true’ values of the measurements and error terms and .
All come from an arbitrary and continuous distribution with mean zero, and
for the variances of error terms, holds.
2.2 Block-Passing-Bablok regression for multiple dependent measurements
Sometimes the data at hand is not independent as stated in assumption 2.1(ii) but grouped into dependent sets of measurements. Examples include common situations, e.g. if measurements have been repeated multiple times for the same sample or a series of measurements is done under the same conditions. In this case, the Passing-Bablok Regression cannot be used straight away. Here, we expand the Passing-Bablok Regression such that it considers the data to be available in groups with members. We index our data points by group and individual from this group. Each represents a measurement of the true values and we assume again a linear relationship
We make the following assumptions for repeated measurement data:
(Passing-Bablok Regression for grouped data)
All points are of the form
with the ’true’ values of the measurements in group and error terms indexed by group and individual from this group.
All come from an arbitrary and continuous distribution with mean zero, and
for the variances of error terms, holds.
(non overlapping groups)
All groups are strictly separated on the x-axis, i.e. almost surely
Estimating the regression parameter from grouped data
The Passing-Bablok regression for independent measurements  makes use of the slopes between all pairs of points and . Under Assumptions 2.1, it is easy to see that for all . In contrast, under Assumptions 2.2 we get that the expected slope between and equals only if and is zero otherwise, i.e for . To include only the meaningful slopes for the estimation of the regression parameter , we have to exclude all repeated measurement pairs within each group from the analysis.
Definition 2.5 (Block-Passing-Bablok regression for grouped data).
For the estimation of the regression parameters and , compute the slopes of the connecting lines between any pair of points from different groups. The slopes are given by
Discard identical measurements as well as all slopes with . Thereby obtain slopes.
The slope parameter is estimated by the shifted median of the slopes with the offset
I.e. if the ranked sequence of slopes is given by , is estimated by
The intercept parameter is estimated by
Naturally, the description of the groupwise method in Definition 2.5 is close to the description of the original method introduced by Passing and Bablok. If we set and we regain the classical estimator for the regression parameter . In this case, we compute the slopes of the connecting lines between any pair of points, that are given by
Without considering groups there are possible lines to connect any two points of an -dimensional data set. Again two identical measurements with and are not considered for the estimation. Further, any slopes with a value of are disregarded, such that we have at most slopes to consider.
Note that setting in Definition 2.5 results in the Theil-Sen estimator. To get an estimator where methods can be used interchangeably, Passing and Bablok defined the offset determined by . The definition of as the number of slopes with a value smaller than in equation (1
) corresponds to the null hypothesis, and needs to be adapted for other hypothesis for the value of . If the null hypothesis is not true, setting in this way introduces a bias towards higher estimates for . The median slope between independent measurements within one group with offset is no longer given by zero if . This effect can be seen in Table 1, e.g. for for both the classic and the Block-PBR. Since the offset drives the median of meaningless slopes between repeated measurements towards , this can also lead to overconfidence for if the classic PBR is used. Table 1 for illustrates this effect.
Theorem 1 (Variance of ).
If Assumptions 2.2 (i)-(iii) hold and
is the expected fraction of triplets with one point from group and two points from group with and
(a) if the group sizes are given by ,
(b) Consequently, if the group sizes are equal, i.e. for ,
Remark 3.2 (classical Passing-Bablok regression).
Theorem 2 (asymptotic normality of ).
Let the number of groups be fixed and consider the limit of a large samplesize .
(a) If the group sizes are given by and Assumptions 2.2 (i)-(iii) hold, is asymptotically normally distributed with mean zero and variance given by
where is the expected fraction of triplets with one point from group and two points from group with .
Corollary 3.4 (confidence interval for ).
Let denote the -quantile of the standardized normal distribution.
-quantile of the standardized normal distribution. Let further,
The asymptotic confidence interval for with significance level is then given by
Remark 3.5 (confidence interval for ).
Let denote the lower and denote the upper limit of the confidence interval for the slope in equation (5). Then,
are the corresponding limits for the intercept .
Corollary 3.6 (Statistical test for the equivalence of two methods with repeated measurements).
Given two measurement methods with measurements given by and .
The equivalence of both methods can be concluded if and .
An intercept indicates a constant systemic difference between the two measurement methods.
A regression parameter indicates a proportional systemic difference between the measurement methods.
Remark 3.7 (non-overlapping groups).
And consequently for equal group sizes
Note that (8) equals the correction for tied ranks in one variable . To see this consider the fact that setting to the mean of the corresponding group for does not change the sign of slopes between separated groups. So compared to the variance for non-overlapping groups the variance for overlapping groups is always smaller. For overlapping groups the test from Corollary 3.6 based on equation (8) is thus a conservative test for the equivalence of the measurement methods, but the power of the test could in principle be improved if a reliable estimate for is available.
4 Diskussion and empirical comparisons of
regular and Block-Passing-Bablok Regression
Passing-Bablok Regression is recommended as a robust method such that extreme values can be included and the errors do not have to be normally distributed . However, the assumption of independent measurements in the classical Passing-Bablok regression seems to be frequently not fulfilled. Duplicated and repeated measurements are often an important part of studies to assess the variance of measurements and help to identify outliers . As a random example among various studies including repeated measurements consider Figure 2 in , where measurements of different patients have been repeated in different numbers. But to which degree does this harm the results of a classical Passing-Bablok regression?
We recall, the Block-Passing-Bablok Regression only considers slopes between pairs of points from different groups for estimating the regression parameter . Since the regression parameters are estimated as medians, a relatively small difference for the number of considered slopes does not have a large influence on the estimations. Hence if the group sizes are equal, instead of slopes as in the original method, we utilize approximately slopes. As shown in Theorem 2 (a) the discrepancy between the unadapted and the Block-Passing-Bablok Regression will quickly vanish for a large number of separated and equally sized groups.
In a hypothetical example with varying group sizes where , the Block-Passing-Bablok Regression yields better results than the original method. The median of all slopes between a very large and a very small group would be a slope within the large group. Since we assume random errors within groups, this median slope would be sampled from a set of meaningless slopes with mean , which transforms into if the offset is used. This biases the estimate of towards and lowers the power to detect . The Block-Passing-Bablok Regression provides more reliable estimates in this setting.
To illustrate the influence of the group sizes and the overlap between groups on the estimation of and the power of the test in Corollary 3.6 we simulated 16 illustrative scenarios. Evaluations of the Passing-Bablok Estimates have been performed with the mcr package  for the classic PBR and with a custom script based on mcr for the Block-PBR. The script is available from the authors upon request. A subset of the considered settings is illustrated in Figure 1. The results of the simulations are shown in Table 1.
, the probability that the true slopeand the probability to reject the null hypothesis .
the number of slopes bigger than minus the number of slopes smaller than ,
as introduced in Definition 3.1.
To simplify the notation for the proofs, we will instead consider a transformed dataset, where are divided by such that the errors in both dimensions are identically distributed (see Assumption 2.2(iii)). If we also substract from , the sign of the slopes in the transformed datasets determines whether a pair of points contributes to or from Definition 3.1. In particular, all figures in the proof section are plotted with transformed coordinates, i.e. .
5.1 Proof of Theorem 1
as the sign of the slope in the transformed dataset, it follows that
We will hence compute
Since for all , the second term on the right hand side vanishes and we have
We split the right hand side into the sums over the squares and the sum over the remaining mixed terms :
(a) The first term on the right hand side is not hard to compute since for all with holds. We get
(b) For the computation of the mixed terms, we only need to consider those with a common pair of indices, i.e. one of the four cases and . Ignoring group assignments there are such combinations. All other combinations vanish in expectation, due to independence. For the expectation of the sum of mixed terms, we look at the covariances of the with each other. Since all cases lead to the same result, we can assume without loss of generality and multiply the result by four.We will combine triplets of mixed terms to simplify the calculations. By distinction of cases as sketched in Figure 2, we obtain for the expectation
considering that the cases occur with probability and , respectively.
Furthermore, we need to consider that we counted each pair of slopes three times due to the arrangement in triples. Consequently, we have to divide by and get the interim result
So far we did not consider the groups.
Since the method from Definition 2.5 does not consider any pairs of data points within the same group,
we have to subtract those triples of indices with , and , as well as .
(c) There are
possible combinations of three elements from two different groups. Let’s say . In that case, the only remaining term in (12) is and we have to subtract the other two terms from our computation of (10) because we wrongfully included them in step (b). In case of strictly separated groups, however, those combinations vanish as can be seen below in equation (15).
In case of two equal indices there are three different possibilities, how the points are located relativ to each other. Looking at the single point from group , it can be located above, in between or below the two points from group (relating to the y-axis). Those arrangements occur with probability each and are illustrated in Figure 3 if and in Figure 4 for . Summed up, those arrangements yield an expectation of
(d) Further, in (13) we wrongfully included the case . This means, we have to substract
the wanted formula for the variance. Naturally, this formula holds likewise for the less complex settings of equally sized non-overlapping groups with for all , i.e.
and for non-grouped data with and we regain the classic result
5.2 Proof of Theorem 2
. To show the asymptotic normality we will show that the moments of the distribution oftend to those of the normal distribution [1, Thm 30.2].
Under the null hypothesis we have that . Hence moments of of odd order vanish. For moments of even order, we need to compute
Consider the expansion of the sum above, which consists of summands with factors each. Since and the are independent if each summand with an independent factor will vanish in the expectation. Let us consider the number of summands where the factors are pairwise linked by exactly one suffix. Each of these summands will look like this:
which due to independence could be split up into the form
For each in each summand as shown in equation (21) there are two other combinations that are exactly the same except that there is an or instead of , respectively. Therefore we define for each triplet of points the pairings
For each summand with pairwise tied indices let be the triplet of indices used in the -th pairing. Let us then consider the set of sums which is represented by Now we see that
if the points of are a disjoint and hence independent set of points.
The remaining questions are:
How many summands with pairwise tied indices are there in the expanded sum ignoring group associations?
Which of these need to be omitted due to group associations for non-overlapping groups?
What changes for overlapping groups?
Answer to (a):
There are ways to choose the first factors of pairings and ways to assign the remaining factors. But now we have counted some combinations twice so we have to divide by to get the final number of ways to choose pairings. Given such a combination there are now different indices to choose, so possibilities.
Thus we end up with
combinations with pairs with tied indices. Finally since we counted each summand times we would get for (20)
if we ignore group associations.
Answer to (b):
For there are 3 different scenarios.
- Scenario 1: all groups are different ()
No difference to the above calculation
- Scenario 2: all groups are equal ()
The corresponding summand is not included in (20) which we mimic by setting
- Scenario 3: two groups are equal ( or or )
Althoug some summands are not included in this case, we do not have to change anything since the effect vanishes for separated groups as shown in equation (15).
For separated groups only scenario 2 alters the above calculation. So we have to subtract the number of combinations with at least one factor where all indices are from the same group. The probablity to choose 3 indices from group is given by and thus the fraction of all sums that have no from scenario 2 tends to
which simplifies for groups with equal sizes to
Answer to (c):
If groups can overlap, let be the number of triplets where groups do not overlap (Scenario 3.1) and the number of triplets where groups overlap (Scenario 3.2).
In this case we get
So for each summand with an odd number of triplets with we have to substract just as in the proof of Theorem 1. We set
The probability to get a summand with odd given there is no triplet from scenario 2 is now given by This leads us to