# Statistical Analysis of Data Repeatability Measures

The advent of modern data collection and processing techniques has seen the size, scale, and complexity of data grow exponentially. A seminal step in leveraging these rich datasets for downstream inference is understanding the characteristics of the data which are repeatable – the aspects of the data that are able to be identified under a duplicated analysis. Conflictingly, the utility of traditional repeatability measures, such as the intraclass correlation coefficient, under these settings is limited. In recent work, novel data repeatability measures have been introduced in the context where a set of subjects are measured twice or more, including: fingerprinting, rank sums, and generalizations of the intraclass correlation coefficient. However, the relationships between, and the best practices among these measures remains largely unknown. In this manuscript, we formalize a novel repeatability measure, discriminability. We show that it is deterministically linked with the correlation coefficient under univariate random effect models, and has desired property of optimal accuracy for inferential tasks using multivariate measurements. Additionally, we overview and systematically compare repeatability statistics using both theoretical results and simulations. We show that the rank sum statistic is deterministically linked to a consistent estimator of discriminability. The power of permutation tests derived from these measures are compared numerically under Gaussian and non-Gaussian settings, with and without simulated batch effects. Motivated by both theoretical and empirical results, we provide methodological recommendations for each benchmark setting to serve as a resource for future analyses. We believe these recommendations will play an important role towards improving repeatability in fields such as functional magnetic resonance imaging, genomics, pharmacology, and more.

## Authors

• 2 publications
• 3 publications
• 3 publications
• 53 publications
• 3 publications
12/30/2019

### The Concordance coe cient: An alternative to the Kruskal-Wallis test

Kendall rank correlation coefficient is used to measure the ordinal asso...
08/26/2020

### On the power of Chatterjee rank correlation

Chatterjee (2020) introduced a simple new rank correlation coefficient t...
09/19/2018

### Modelling the data and not the images in FMRI

The standard approach to the analysis of functional magnetic resonance i...
07/02/2021

### Data-driven mapping between functional connectomes using optimal transport

Functional connectomes derived from functional magnetic resonance imagin...
12/01/2020

### Permutation-based true discovery proportions for fMRI cluster analysis

We develop a general permutation-based closed testing method to compute ...
10/10/2017

### Quantitative Comparison of Statistical Methods for Analyzing Human Metabolomics Data

Background. Emerging technologies now allow for mass spectrometry based ...
12/15/2019

### Generalized reliability based on distances

The intraclass correlation coefficient (ICC) is a classical index of mea...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Data repeatability is defined as consistency or similarity across technical replicates of a measurement. To avoid ambiguity, we restrict the use of the term without assuming one of the replicates is the correct, true measurement. The same definition is often referred to as test-retest reliability, or the reproducibility of a measurement procedure, where the consistency of repeated measurements is being emphasized [muller1994critical]. However, caution should be taken that the general concepts of reliability and reproducibility are often applied beyond the definition of repeated measurements’ consistency, depending on the actual context. General reviews of the concept of research reproducibility, with comparison to replicability can be found in [goodman2016does, patil2016statistical]. A rich literature exists for other related, but distinct, types of reliability, such as inter-rater reliability (an overview can be found [gwet2014handbook]). In summary, we selectively focus on the evaluation of data repeatability, as a crucial starting point for evaluating measurement validity.

Data repeatability reflects the stability of the whole data generating process, which often creates inevitable noise and variability [bach2012knowing, beck2012not], or potentially involves complex steps of data collecting and data preprocessing, especially for studies of big data [labrinidis2012challenges, garcia2016big, lichtman2014big]. It can be considered as the counterpart of the stability property of statistical methodology [yu2013stability], which both are cornerstones for the reproducibility of scientific research. Such critical role is highlighted as the reproducibility crisis becomes a concern in many scientific domains [Baker2016-BAKSL-2, open2015estimating, button2013power]. Repeatability is also used as a key tool to detect likely irreproducible findings and statistical errors. For example, a recent outcry over issues in repeated use of data in the field of cognitive neuroscience [vul2009puzzlingly] relied on absence of the required repeatability level as proof of the issue. Some have also argued that the misinterpretation of repeatability can result in false confidence in a study’s reproducibility and subsequently lead to the neglect of important design issues [turner2018small]. A thorough investigation and accurate interpretation of data repeatability is crucial for a better understanding of existing issues of reproducibility and working towards better future practices.

The intraclass correlation coefficient (ICC) is a commonly used metric for data repeatability or test-retest reliability. However, the ICC is limited in several ways when applied for multivariate and complex big data. First, it was developed for univariate data, and there is no consensus on how one should synthesize multiple ICC’s over each dimension of the measurement, or for measurements with different dimensions. The definition and inference of ICC is based on a relatively strict parametric ANOVA model assuming separability and additivity. Often, Gaussian assumptions are applied for inference, an assumption that is often suspect in reality.

Recently, several novel data repeatability measures have been proposed, including fingerprinting, which is based on the idea of subject identification [finn2015functional, finn2017can, wang2018statistical], rank sums [airan2016factors], and the image intraclass correlation coefficient (I2C2) [shou2013quantifying], which is a generalization of the classical univariate ICC. Unlike univariate methods, such as ICC, these newly proposed methods can handle high-dimensional complexity and computationally scale. By building the measures on ranks transformations, the nonparametric methods (fingerprinting, rank sums) are robust to model violations.

However, the relations between, and the best practices among, these methods remains largely unknown. Furthermore, clear relationships in interpretations and performance are lacking. Thus, often less effective or robust measures of data quality are being used, potentially leading to worse study practices, worse processing pipelines and sub-optimal performance of prediction algorithms.

In this manuscript, we particularly focus on discriminability [bridgeford2019optimal], a new data repeatability measure. It is defined upon a general repeated measurement model that is free of parametric assumptions, yet remains deterministically linked to ICC for univariate measurement, when ANOVA assumptions are met. It has been proved that (under assumptions) the most discriminable measurements are optimally accurate in the Bayes error rate of subsequent inferential tasks, regardless of what the actual task is.

We focused on discriminability by investigating its mathematical relationships with other multivariate repeatability measures. This resembles the relation between optimal intra- and inter-subject correlations and ICC of the measurements under univariate scenarios, an idea that has been recently studied and discussed in neuroimaging [vul2009puzzlingly, bennett2010reliable, zuo2014test].

In addition, we numerically compared the methods in the terms of their ability to detect significance in permutation tests that were specifically designed for discovering the existence of data repeatability. To summarize, our results illustrate the general power advantages of discriminability when compared to other nonparametric methods, and its robustness advantages against the violation of Gaussian assumptions, when compared to parametric methods. Of course, parametric methods may be more powerful when distributional assumptions are satisfied. In addition,the rank sum method shows additional robustness against mean shift batch effects compared to discriminability.

The field of functional magnetic resonance imaging (fMRI) is a concrete example where data repeatability is of key interest, because the data quality can be impacted by noise, biological confounds, and complex acquisition and processing choices. The results discussed in this manuscript will potentially improve the evaluation and optimization of fMRI data repeatability. General reviews of fMRI reliability can be found [bennett2010reliable, frohner2017addressing]

, although emphasis was put on the reliability of results, not necessarily restricted to measurement. For example, popular cluster-overlap-based reliability measures, such as the Dice coefficient and Jaccard index, are restricted to the similarity analysis of graphs and sets and are not applicable for other types of data. Some concepts similar to inter-rate reliability in fMRI, such as inter-site, inter-scanner or inter-technologist reliability, are not discussed in detail, but the measures discussed in this manuscript can be applied.

The quantification of data repeatability is even more crucial for fMRI-based functional connectivity (FC), where second order statistics (usually correlations) or various network-based graph metrics are the object of study. Resting state correlations are particularly sensitive to biological confounds, in contrast to task based fMRI, where the confound is often not correlated with the task. Variability can be induced by changes in physiological and cognitive status of a subject, within a single scan session or between two sessions that are hour, days, or months apart. In addition, common practices in the field can raise questions in data quality too [zuo2014test, jiang2015toward]. For example, auto-correlations in the BOLD time series might violate independence and parametric assumptions in correlation analyses. Averaging the time series over a large region may involve voxels with low functional homogeneity and introduce spurious variability. It is also a concern when, as is typical, a number of reasonable preprocessing options are available that produce varying measurement outcomes. Processing choices can be particularly difficult to generalize across studies, since target measurements can be on different scales or formed with a different data reduction strategy (seed-to-voxels, voxel-by-voxel, region-by-region, etc.). In all these scenarios, understanding data repeatability is prerequisite for any meaningful scientific discovery or clinical application. Objective repeatability measures, preferably non-parametric and able to accommodate varying data dimensions, (such as the nonparametric measures discussed in this manuscript) is needed. Moreover, the application of data repeatability goes beyond such questions regarding the quality of correlation-based FC measurements [noble2019decade], but also for its broad applications. Some examples include: selecting best practices for data acquisition and preprocessing [pervaiz2019optimising], identifying FC biomarkers [gabrieli2015prediction, castellanos2013clinical, kelly2012characterizing], optimizing FC-based prediction models [svaldi2019optimizing], and evaluating the accuracy of multi-class prediction algorithms [zheng2018extrapolating].

## 2 Review of Existing Data Repeatability Measures

In this section, we will define several measures of data repeatability under their associated statistical models. We define the measures as population quantities; we subsequently give the natural estimators for each.

### 2.1 Intraclass Correlation Coefficients

We consider two types of intraclass correlations, ICC and I2C2 [shou2013quantifying]. Without modifications, ICC is designed for evaluating the repeatability for one dimensional measurements, such as expert ratings or composite mental health scores. It can also be utilized in various ways for multivariate measurements, for example, by averaging ICCs over each of the dimensions or by counting percentage of dimensions that pass a threshold on ICC, say, being greater than 0.4. However, for the latter scenario there is no consensus on the best practice, and the interpretation may be subjective to the researcher’s choices. ICC can be generalized to higher dimensions provided a multivariate model that decomposes variation into a sum of intra- and inter-subject levels and a definition of the fraction of variation that is inter-subject. I2C2, is one such generalization of ICC for multivariate settings that was designed for high dimensional settings.

Other generalizations of ICC are outside the setting of interest for this paper. For example, intraclass correlations can also be defined under various two-way ANOVA models [shrout1979intraclass], which are suitable for the evaluation of inter-rater reliability or internal consistency. However, these measures are not relevant for the evaluation of test-retest reliability [rousson2002assessing, bruton2000reliability]. Other popular reliability measures, such as variations on the Alpha and Kappa statistics are not covered, for the same reason of being less relevant to the study of data repeatability.

To elaborate on models, for ICC, suppose that we have subjects, each with

measurements. A univariate Analysis of Variance (ANOVA) model with Gaussian random effects is specified as:

 xit=μ+μi+eit, (1)

where and are mutually independent.

For -dimensional measurements, (1

) is generated as Multivariate Analysis of Variance (MANOVA) with Gaussian random effects:

 xit=μ+μi+eit, (2)

where ,

, independently. All the vectors are

-dimensional.

In the univariate case (1), ICC is defined as:

 λ=\corrxit,xit′=σ2μσ2μ+σ2,

for all . Assuming the measurements of a same subject form a class, then and are both from the -th class, hence the name ("intra-class").

For the multivariate case (2), a popular generalization of ICC using matrix determinants is

 Λ=det(Σμ)det(Σμ)+det(Σ),

commonly known as Wilks’ lambda (). Using matrix traces, the generalization becomes

 Λtr=\trace(Σμ)\trace(Σμ)+\trace(Σ).

This repeatability measure is particularly useful for high-dimensional imaging settings and was utilized in the the image intraclass correlation coefficient (I2C2) [shou2013quantifying]. Recall that the trace of the covariance matrix captures the total variability of the random quantity of interest. Then, intuitively represents the fraction of the variability in the observed data due to the subject effect .

For the univariate case, the estimation of ICC is often conducted through one-way ANOVA. It is also well known that is a non-decreasing function of the F statistic given .

I2C2 was estimated using a hierarchical generalization on principal components called multilevel functional principal components analysis (MFPCA)

[di2009multilevel]

. The MFPCA algorithm utilizes a moment based approach to separate variability into inter- and intra-subject components in a method similar to Henderson’s equations in mixed models

[henderson1959estimation]

. Singular value decomposition tricks can be used to make calculations tractable in high dimensions

[zipunnikov2011multilevel]. In principle, other multivariate approaches can be used to estimate and . For example, it would be a straightforward change in I2C2 to estimate instead of . In addition, latent Gaussian models [chib1998analysis] can extend these approaches to binary data and graphs [yue2015estimating].

One of the commonly discussed properties of ICC is its relation with the optimal correlation between two univariate outcomes [vul2009puzzlingly, bennett2010reliable, zuo2014test]. It states:

 \corrx(1)it,x(2)it=\corrμ(1)i,μ(2)i√ICC(x(1)it)⋅ICC(x(2)it),

where and follow the ANOVA model with subject random effects as and

respectively, without the requirement of Gaussian distributions.

### 2.2 Fingerprinting

As it’s name suggests, fingerprinting is the idea of matching subjects to themselves in repeated measurements where errors could potential occur by mismatches with other subjects [wang2018statistical]. The count or proportion of matches for a matching scheme represents an intuitive summary of data repeatability. This measure has become especially popular in neuroimaging due to a few highly visible articles [anderson2011reproducibility, finn2015functional, xu2016assessing].

We first formalize the idea of a population-level fingerprinting measure for repeated measurements. It is assumed that each subject is measured twice, and that the measurement is possibly multivariate. Then each subject, , at time point, , has measurement, , , . Suppose there exists a distance metric, , defined between measurements, , and . Define the population level fingerprint index as:

 Findex=\probδi,1,2<δi,i′,1,2;∀i′≠i, (3)

where the probability is calculated over a random sample of

subjects. This is the population probability that a random subject matches themselves over any other in the sample.

Implicitly, such measure is defined under a much more flexible model. For (3) to be a meaningful population quantity, it is only required that the resulted is equal for all ’s, which covers the (M)ANOVA models (1) and (2) with Gaussian random effects as special cases. However, the relationship between ICC and the fingerprinting index is unknown.

The natural estimate of (3) is the proportion of correct matches in a group of subjects. This requires assuming a matching strategy, such as whether matching is done with or without replacement [wang2018statistical]. Almost all fingerprint index studies use matching with replacements as follows. The total number of correct matches (with replacement) is:

 Tn=n∑i=1\indicatorδi,1,2<δi,i′,1,2;∀i′≠i,

where is the indicator function. Then, the fingerprint index estimator is simply the proportion of correct matches:

 ^Findex=Tnn. (4)

### 2.3 Rank Sums

In the test-retest setting with , the fingerprint statistic can be generalized as a Mann-Whitney style statistic. Instead of counting the events where is the closest to among all other with , consider calculating the rank. Formally, the rank sum statistic is defined by summing up ’s, the rank of among all with . Assuming that there are no ties (or the max ranks are assigned) then the rank sum statistic is defined as:

 Rn=n∑i=1rii=n∑i=1∑i′≠i\indicatorδi,1,2<δi,i′,1,2.. (5)

Notice that ; thus the ranks are sufficient for determining the fingerprint index. Of course, the fingerprinting statistic ignores the information contained in ranks, other than the number of the ranks equal to within subjects. Thus, it may seem obvious that the rank sum statistic is superior to the fingerprint statistic in some sense. However, it should also be noted that the rank sum statistic lacks an intuitive relationship with a population quantity, like the fingerprint statistic does with the fingerprint index. In addition, both the fingerprint and rank sum statistics lack an obvious generalization for repeated measurements, as they were developed on compared paired measurements.

## 3 Discriminability as a Repeatability Measure

In this section, we will formally define the concept of discriminability under a flexible model of repeated measurements. We will then prove that discriminability is indeed a repeatability measure, as it is deterministically related to ICC when the Gaussian ANOVA assumptions are met. Notably, an optimal accuracy property of discriminability in the Bayes error rate is applicable for multivariate measurements, whereas this property has only been shown under univariate measurements for ICC. We will also investigate the relation between discriminability and the other aforementioned measures with the goal of increasing interpretability across studies when using different repeatability measures.

### 3.1 General Model of Repeated Measurements

Let be a true physical property of interest for subject . Without the ability to directly observe , we instead observe , for some random measurement process , where characterizes the measurement process, and is the observed measurement of property . As is a random process, the index, , is used to emphasize that the observation using process may differ across repeated trials, typically performed sequentially in time.

In many settings, the measurement process may suffer from known or unknown confounds created in the process of measurement. For example, when taking a magnetic resonance image (MRI) of a brain, the MRI may be corrupted by motion (movement) or signal intensity artifacts. The observed data, , may therefore be unsuitable for direct inference, and instead is pre-processed via the random process to reduce measurement confounds. Here, characterizes the pre-processing procedure chosen, such as motion or other artifact correction in our MRI example. We define as the pre-processed measurement of for subject from measurement index . Let be a metric. We use simplified notations such as and .

Data repeatability can be considered as a function of the combination of an acquisition procedure, , and a chosen pre-processing procedure, . Of course, it can be defined exclusively for a subset of the data generating procedure. For instance, when the data has already been collected, the researchers may only be able to manipulate pre-processing, and not acquisition, , procedures. Then, one intended use of the repeatability measure is to optimize over those aspects of the measurement process the researcher is able to manipulate: , where is an unspecified repeatability measure.

Although we will define discriminability with the general framework above, the following additive noise model is a useful special case that maintains tractability:

 xit=vi+ϵit (6)

where , and with . Such modeling still contains (M)ANOVA scenarios as special cases and is free of parametric assumptions, where the fingerprinting index and the discriminability are both well-defined. This model will be revisited as we discuss the permutation tests in the next chapter.

### 3.2 Definition of Discriminability

If the measurement procedure is effective, we would anticipate that our physical property of interest for any subject , , would differ from that of another subject , . Thus, an intuitive notion of reliability would expect that subjects would be more similar to themselves than to other subjects. Specifically, we would expect in a good measurement that is more similar to (a repeated measurement on subject ) than to (a measurement on subject at time ).

Discriminability is defined as:

 D\parens∗ψ,ϕ=\probδi,t,t′<δi,i′,t,t′′.

Similar to the fingerprinting index, discriminability is well defined as long as is equal for all (such that ). That is, this definition assumes that discriminability does not depend on the specific subjects and measurements being considered. This can be considered a form of exchangeability. Subsequently, we consider models that are consistent with this definition in the Gaussian (M)ANOVA models (1), (2). One could consider a form of population averaged discriminability if does depend on subjects. However, this is outside of the scope of this manuscript.

To estimate discriminability, assume that for each individual, , we have repeated measurements. Sample discriminability is then defined as:

 ^D=n∑i=1s∑t=1∑t′≠t∑i′≠is∑t′′=1\indicatorδi,t,t′<δi,i′,t,t′′n⋅s⋅(s−1)⋅(n−1)⋅s, (7)

where is the total number of subjects. Then represents the fraction of observations where is more similar to than to the measurement of another subject , for all pair of subjects and all pairs of time points .

Under the additive noise model (6), it can be proven that is unbiased and consistent for discriminability (Appendix A).

### 3.3 Discriminability is Deterministically Linked with ICC

Interestingly, under the ANOVA model (1), discriminability is deterministically linked to ICC. It is relatively easy to argue and instructive on the relationship between these constructs, and therefore we present the argument here. Considering a Euclidean distance as the metric, discriminability () is:

 D =\prob\abs∗xit−xit′<\abs∗xit−xi′t′′ =\prob\abs∗eit−eit′<\abs∗μi−μi′+eit−eit′′ def=\prob|A|<|B|

for , . Then

follows a joint normal distribution, with mean vector

and covariance matrix . Hence:

 D =1−arctan(√σ2(3σ2+4σ2μ)σ2μ)π=12+1πarctan(ICC√(1−ICC)(ICC+3)).

Therefore, and ICC are deterministically linked with a non-decreasing transformation under the ANOVA model with Gaussian random effects. Figure 1 shows a plot of the non-linear relationship. For an ICC of roughly 0.68, the two measures are equal, with discriminability being smaller for ICCs larger than 0.68 and larger for ICCs lower. It is perhaps useful to let to transform discriminability to range between to , similar to ICC.

Recall, the optimal correlation between two univariate measurements equals to a non-decreasing function of the ICC of each of the measurement. Since discriminability is deterministically linked to ICC via a strictly increasing function, this property also holds for discriminability.

Another scenario where the repeatability measure may become critical is in the prediction problem with multivariate predictors. Under such scenario, the optimal prediction error in terms of the Bayes error rate of a classification task can be bounded by a decreasing function of discriminability of the multivariate predictors [bridgeford2019optimal]. Thus, it is interesting to note that ICC inherits this property exactly, as it holds for any one-to-one transformation of discriminability.

### 3.4 Relation with Other Repeatability Measures

#### 3.4.1 Fingerprinting

In the test-retest setting where the fingerprint index is defined, we can prove that the fingerprint index has the following relationship with the discriminability, , (Appendix C)

 Findex=ρD+(1−ρ)Dn−1,

so long as the correlation, , is non-negative for .

The non-negativity condition can be checked with simulation or numerical integrals when a parametric model is posited. For example, under the Gaussian ANOVA model, (

1), where the univariate ICC is defined, the aforementioned correlation, , is positive for all the simulated values of and between and .

When the non-negativity condition holds true, the fingerprint index decreases to a limit of , as the sample size, , increases. However, the diminishing term, , may not be negligible with large enough and small enough . This illustrates the fact that the fingerprint index may not be invariant for different sample sizes that are below 10 to 15, even when the discriminability holds constant.

#### 3.4.2 Rank Sums

The relationship between population discriminability and the fingerprint index relies on a data dependent correlation value, , and there is no direct relationship in their estimators. However, interestingly, the sample discriminability can be rewritten as a function of a form of rank sums. In addition, the specific form of rank sum statistics, , can be transformed to a consistent estimator of discriminability. In comparison, These suggest that the estimator of discriminability retains the rank information that the fingerprint statistic discards. Below we demonstrate this relationship.

Denote the by inter-measurement distance sub-matrix as . Let the combined by distance matrix be , which consists of by blocks where the block is . Let denote the ranking within rows in the combined distance matrix . We assign the maximum ranks for ties.

It can be shown (Appendiex B) that another consistent estimator of discriminability in the rank form is

 ~D=n2s2(s−1)−∑st=1∑t′≠t∑ni=1rtt′iins(s−1)(n−1)s. (8)

This representation highlights the close relation between discriminability and rank sums.

In fact, the specific form of the rank sum statistic, (5), can be transformed to another estimator of discriminability. In a test-retest setting with , instead of ranking the combined distance matrix, , let be the rank of among , which ranks the row of the inter-measurement distance sub-matrix . If ties occur, the max ranks are assigned.

This transformation of the rank sum statistic, , forms an unbiased and consistent estimator of :

 ^Drs=∑ni=1(n−rii)n(n−1)=n2−Rnn(n−1). (9)

If there exist multiple measurements for each subject, for all the pairs of distinct and , the rank sum statistic and estimation can be calculated between the -th measurements and the -th measurement. Comparing to and , the rank sum statistic does not involve any ranking information from the diagonal blocks in the combined distance matrix,

,. This may result in a larger standard error for estimation and a lower power for inference using the rank sums. However, it provides some robustness against mean shift batch effects, as demonstrated in Section

5.3.

#### 3.4.3 I2c2

Under the -dimensional MANOVA model specified in (2), again considering the Euclidean distances, discriminability becomes:

 D =P(||xit−xit′||−||xit−xi′t′′||<0) =P(||eit−eit′||−||eit−ei′t′′+μi−μi′||<0) def=P(||A||−||B||<0),

where and are jointly multivariate normal with means 0, variances and , respectively, and covariance, . Note that is an indefinite quadratic form of the vector

(around a matrix whose block diagonal entries are an identity matrix and the negative of an identity matrix). Thus,

can be decomposed as a linear combination of independent variables [provost1996exact]:

 Z D=r∑u=1λuUu−r+w∑u=r+1λ′uUu, (10)

where

are the positive eigenvalues of

, are the absolute values of the negative eigenvalues of , are IID

variables with degrees of freedom being

.

Although this does not result in a deterministic link between and I2C2, it can be shown that there exist approximations matching the first two moments of and . Furthermore, the approximation of can be bounded by two non-decreasing functions of I2C2 (Appendix D

). Specifically, the resulting discriminability approximation has the form of a CDF value of an F-distribution,

 D=P(Z≤0) ≈FF(V21W1,V22W2)(V2V1), (11)

where (or ) are the sum and the sum of squares of the absolute values of the positive (or negative) eigenvalues. Moreover, when are constant, the approximation is bounded by a non-decreasing interval of I2C2 (Figure 2):

 FF(V21W1,V22W2)(f1(Λtr))≤FF(V21W1,V22W2)(V2V1)≤FF(V21W1,V22W2)(f2(Λtr)),

where and are both non-decreasing functions.

## 4 Permutation Tests of Repeatability

Permutations tests for extreme large values can be conducted using the repeatability statistics described in Section 2

. Essentially, such permutation tests are constructed based on a distributional exchangeability null hypothesis on the permuted statistics. That is, under the null, the distribution of the repeatability statistic is assumed to be invariant against some permutation of the subject labels. For repeated measurements with multiple time points, the subject labels are permuted within each of the time points.

In practice, non-parametric approximation of the test statistic distribution under the null can be achieved by actually permuting the observed sample. In fact, to perform the test, Monte Carlo resampling

[good2013permutation] is used to reduce the computational burden of looping over each of the possible permutations, which can be up to scenarios for subjects measured at time points. Exploiting the approximated null distribution, the test rejects the null when the observed value of the repeatability statistic is more extreme than one would have expected under the null given significance level.

Under the additive noise setting (6) for the general model of repeated measurements, implies , which guarantees the exchangeability of any repeatability statistics defined in the previous sections. Thus, if the associated model is correctly specified, rejection in the permutation test using any of the aforementioned statistics implies the existence of dependence between a subject’s unobserved true subject-specific effect, , and its observed measurement, . Therefore, permutation tests with the weaker null of exchangeability are conducted for the purpose of confirming repeatability. The resulting test significance provides evidence against no repeatability, where the measurement reveals no information on differences in subject specific effects.

However, the properties of these repeatability statistics under different model settings than the ANOVA model are less mathematically clear. In Section 5 we present numerical results, including deviations from the ANOVA model.

## 5 Numerical Experiments

### 5.1 Univariate ANOVA Simulations

We first evaluate the estimations and testing powers under the ANOVA model (1) or when its Gaussian assumptions are violated. , . The number of subjects, , ranges from to .

In addition to the correct Gaussian model, consider the following lognormal misspecification:

 μi \distasd\Lognorm0,σ2μ;log(μi)\distasd\Norm0,σ2μ, eit \distasd\Lognorm0,σ2;log(eit)\distasd\Norm0,σ2,

where we still define , but now . Note that the relation between discriminability and ICC does not hold in this setting.

For , iterations, estimates of discriminability (using in the Equation 8), the rank sum estimator ( in Equation 9), estimations of ICC using one-way ANOVA, estimations of the fingerprint index (using in the Equation 4) were recorded and compared to their theoretical true values (for discriminability and ICC) or its simulated average value (for the fingerprint index, with , simulations).

Within each iteration, we also conduct permutation tests against exchangeability, each with Monte Carlo simulations, using the aforementioned types estimators. F-tests using the ICC F-statistics were also conducted. The proportion of rejections (power curves) by iterations were plotted.

When the parametric assumption is satisfied, all estimators are distributed around their true values (Figure 3

). Note that the distribution of the fingerprint index is skewed. In addition, a higher fingerprinting index estimation with fewer subjects does not imply better repeatability, compared the lower estimation with more subjects. Of note, the true ICC and discriminability remain constant as sample size increases in the simulation setup. Thus, insofar as these measures summarize repeatability, this emphasizes that the fingerprint index is not directly comparable across sample sizes. In terms of the testing power, as we expected, tests using statistics associated with the ICC produce higher power, as the Gaussian model is correctly specified. The discriminability estimator using the whole combined ranking matrix shows slight advantage in power compared to the rank sum estimator, which only uses rank sums within a submatrix of the combined distance matrix. Lastly, switching to fingerprinting results in a loss in testing power.

We repeated the simulation in an otherwise similar setting where normality does not hold: , , and is around . Because of model misspecification, ICC is overestimated with relatively large variation. As for testing power, the discriminability estimator, rank sum and the fingerprint index estimator outperform, due to their nonparametric framework, which does not rely on Gaussian assumptions. again has higher power than for including more ranking information. has a loss in power over disciminability or rank sums, but is now better than the tests using parametric estimations of ICC or F-statistics.

### 5.2 MANOVA Simulations

Next, we consider the MANOVA model (2) and a similar misspecification with element-wise log-transformations on the subject mean vectors, , and the noise vectors, . . ranges from to .

We simulate data with , , and (an exchangeable correlation matrix, with off diagonals ). Let . For , iterations, the estimations and the permutation test (each performed with Monte Carlo simulations) power were compared for discriminability, the rank sum estimator of discriminability, the fingerprint index, the sample ICC, , calculated with the first principal components from the measurements, and I2C2.

When the Gaussian assumption is satisfied, I2C2 outperforms other statistics, and most statistics produce higher testing power compared to the fingerprint index (by a large margin Figure 4). Note that the strategy of conducting PCA before ICC also shows advantage over discriminability in power when the sample size is as small as , but power converges with larger sample sizes.

When normality is violated, the nonparametric statistics (discriminability, rank sums, and fingerprinting outperform the parametric methods in power with any sample sizes greater than . The discriminability estimator provides the best power under the multivariate lognormal assumptions.

### 5.3 Batch Effects

Consider the ANOVA model (1) where each subject is remeasured for times, . We evaluate two types of batch effects, mean shifts and scaling factors [johnson2007adjusting].

For the mean shifts, we replace the subject means, ’s, with the batch specific means ’s defined as:

 μi1 \distasd\Norm0,σ2μ μit =μi1+t,t=2,…,s.

Without loss of generality, consider the first batch as a reference batch, where ’s follow the same distribution as the previous ’s. For the -th batch, there exists a mean shift, , from the reference batch for all subjects. The scaling effects were applied on the noise variances as:

 ei1 \distasd\Norm0,σ2 eit \distasd\Norm0,tσ2,t=2,…,s.

Note that by default does not handle multiple repeated measurements. In order to thoroughly compare the original discriminability estimator, (8), and the rank sum based estimator, (9), at each time point , we considered the following different repeatability estimators. First, we considered only the first and the -th batches (first-last) and computed and directly. Secondly, we used all measurements up to the -th time point (all batches). The estimator can be directly calculated, whereas the can be generalized by averaging on all pairs of time points. Lastly, we considered a special case where we averaged over only the pairs of time points between the first and the rest (first-rest) for both and . In total, six multi-time-point discriminability estimators were considered, where three of them are -based and the other three are -based.

We simulated batches in total with and let the number of subjects, , range from to . For , iterations, the estimations and the permutation test (each with Monte Carlo iterations) power of the six estimators described above are plotted.

For the mean shift only batch effects, the rank sum estimator outperforms discriminability in power with the highest power achieved using all time point pairs (Figure 5). The estimation from rank sums is also closer to the batch-effect-free true discriminability, . The rank sum method may benefit from the fact that, whenever , it avoids averaging over indicators

 \indicatorδi,t,t′<δi,i′,t,t′′=\indicator|(t−t′)+(eit−eit′)|<|(μit−μi′t)+(eit−ei′t)|,

where the batch difference, , if larger enough, may force the indicator to be with high probability, regardless of the true batch-effect-free discriminability level. For example, for the all pairs from initial scenario, rank sums outperform discriminability by a huge margin, since batch differences become larger when later batches are compared to the reference batch.

For the scaling only batch effects, discriminability now outperforms rank sums, regardless of the strategy used. (Using all time points produces the highest power.) This is similar to the case with no batch effects, where having more repeated measurements increases testing power, and the advantage of discriminability over rank sums and the advantage of using all time points are attained.

## 6 Discussion

One of our major findings is the relationship between discriminability, ICC or I2C2 on the population level. Note this is different from the non-decreasing relation between ICC estimation and the F statistic, which guarantees the same ordering and power in the permutation test. The fact that ICC and I2C2 may still have higher power when parametric assumptions are satisfied hints the potential of improving the current discriminability estimation. Another potential improvement is the approximation (11) of the weighted sum of ’s, as it tends to underestimate with larger within measurement correlations (Figure 6). But, even with the current approximation the error is within and the non-decreasing relation holds true in the simulations with larger values. Other limitation includes the lack of analysis for the fixed effect, while we focus on the random effect models for cleaner illustration. Lastly, in practice dissimilarity (pseudo)distances such as one minus Pearson correlation may be applied instead of the Euclidean distance; this does not impact testing results if measurements are standardized with mean and variance , and if measurements are non-negatively correlated.

On the other hand, the relation we found with rank sums and fingerprinting is between the testing statistics; based on the simulations we argue that the discriminability should be preferred in practice unless there exist concerns about mean shift batch effects.

##### Author Information and Acknowledgements

Johns Hopkins University, Progressive Learning.

## Appendix A Unbiasedness and Consistency of ^D

Assume that for each individual , we have repeated measurements. We define the local discriminability:

 ^Dni,t,t′ =∑i′≠is∑t′′=1\indicatorδi,t,t′<δi,i′,t,t′′s⋅(n−1) (12)

where is the indicator function, and is the total number of subjects. Then represents the fraction of observations from other subjects that are more distant from than , or a local estimate of the discriminability for individual between measurements and . The sample discriminability estimator is:

 ^Dn =n∑i=1s∑t=1∑t′≠t^Di,t,t′n⋅s⋅(s−1) (13)

Where is the local discriminability. We establish first the unbiasedness for the local discriminability, under the additive noise setting:

 xit=vi+ϵit (14)

where , and with . That is, our additive noise can be characterized by bounded variance and fixed expectation, and our noise is independent across subjects.

[local discriminability is unbiased for discriminability]   For fixed :

 \expect^Dni,t,t′=D (15)

that is; the local discriminability is unbiased for the true discriminability.

###### Proof.
 \expect^Dni,t,t′ =\expect∑i′≠is∑t′′=1\indicatorδi,t,t′<δi,i′,t,t′′s⋅(n−1) =∑i′≠is∑t′′=1\expect\indicatorδi,t,t′<δi,i′,t,t′′s⋅(n−1)\undereqLinearity of Expectation =∑i′≠is∑t′′=1\probδi,t,t′<δi,i′,t,t′′s⋅(n−1) =∑i′≠is∑t′′=1Ds⋅(n−1) =\hcancels⋅(n−1)⋅D\hcancels⋅(n−1) =D

Without knowledge of the distribution of , we can instead estimate the discriminability via , the observed sample discriminability. Consider the additive noise case. Recall that , the sample discriminability for a fixed number of individuals . We consider the following two lemmas:

[Unbiasedness of Sample Discriminability]   For fixed :

 \expect^Dn=D

that is; the sample discriminability is an unbiased estimate of discriminability.

###### Proof.

The proof of this lemma is a rather trivial application of the result in Lemma (A).

Recall that sample discriminability is as-defined in Equation (13). Then:

 \expect^Dn =\expectn∑i=1s∑t=1∑t′≠t^Di,t,t′n⋅s⋅(s−1) =n∑i=1s∑t=1∑t′≠t\expect^Dni,t,t′n⋅s⋅(s−1) =n∑i=1s∑t=1∑t′≠tDn⋅s⋅(s−1)\undereqLemma (???) =\hcanceln⋅s⋅(s−1)⋅D\hcanceln⋅s⋅(s−1) =D

[Consistency of Sample Discriminability]   As :

 ^Dn\convp[n→∞]D

that is; the sample discriminability is a consistent estimate of discriminability.

###### Proof.

Recall that Chebyshev’s inequality gives that:

 \prob\abs∗^Dn−\expect^Dn≥ϵ =\prob\abs∗^Dn−D≥ϵ\undereq^Dni,t,t′ is unbiased ≤\var^Dnϵ2

To show convergence in probability, it suffices to show that . Then:

 \var^Dn =\varn∑i=1s∑t=1∑t′≠t^Dni,t,t′n⋅s⋅(s−1) =1m2∗\varn∑i=1s∑t=1∑t′≠t∑i′≠is∑t′′=1\indicatorδi,t,t′<δi,i′,t,t′′,m∗=n⋅s⋅(s−1)⋅(n−1)⋅s =1m2∗∑i,i′,t,t′,t′′∑j,j′,r,r′,r′′\cov\indicatorδi,t,t′<δi,i′,t,t′′,\indicatorδj,r,r′<δj,j′,r,r′′

Note that there are, in total, covariance terms in the sums. For each term, by Cauchy-Schwarz:

 \abs∗\cov\indicatorδi,t,t′<δi,i′,t,t′′,\indicatorδi,t,t′<δi,j′,t,r′′ ≤√\var\indicatorδi,t,t′<δi,i′,t,t′′⋅\var\indicatorδi,t,t′<δi,j′,t,r′′ ≤√14⋅14=14

Furthermore, note that . Under the assumption of between-subject independence, then , as it will be independent of any function of subjects other than and . Then as long as ,