    # Ranks and Pseudo-Ranks - Paradoxical Results of Rank Tests -

Rank-based inference methods are applied in various disciplines, typically when procedures relying on standard normal theory are not justifiable, for example when data are not symmetrically distributed, contain outliers, or responses are even measured on ordinal scales. Various specific rank-based methods have been developed for two and more samples, and also for general factorial designs (e.g., Kruskal-Wallis test, Jonckheere-Terpstra test). It is the aim of the present paper (1) to demonstrate that traditional rank-procedures for several samples or general factorial designs may lead to paradoxical results in case of unbalanced samples, (2) to explain why this is the case, and (3) to provide a way to overcome these disadvantages of traditional rankbased inference. Theoretical investigations show that the paradoxical results can be explained by carefully considering the non-centralities of the test statistics which may be non-zero for the traditional tests in unbalanced designs. These non-centralities may even become arbitrarily large for increasing sample sizes in the unbalanced case. A simple solution is the use of socalled pseudo-ranks instead of ranks. As a special case, we illustrate the effects in sub-group analyses which are often used when dealing with rare diseases.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

If the assumptions of classical parametric inference methods are not met, the usual recommendation is to apply nonparametric rank-based tests. Here, the Wilcoxon-Mann-Whitney and Kruskal-Wallis (1952) tests are among the most commonly applied rank procedures, often utilized as replacements for the unpaired two-sample -test and the one-way ANOVA, respectively. Other popular rank methods include the Hettmansperger-Norton (1987) and Jonckheere-Terpstra (1952, 1954) tests for ordered alternatives, and the procedures by Akritas et al. (1997) for two- or higher-way designs. In statistical practice, these procedures are usually appreciated as robust and powerful inference tools when standard assumptions are not fulfilled. For example, Whitley and Ball (2002) conclude that “Nonparametric methods require no or very limited assumptions to be made about the format of the data, and they may, therefore, be preferable when the assumptions required for parametric methods are not valid.” In line with this statement, Bewick et al. (2004) also state that

“the Kruskal-Wallis, Jonckheere-Terpstra (..) tests can be used to test for differences between more than two groups or treatments when the assumptions for analysis of variance are not held.”

These descriptions are slightly over-optimistic since nonparametric methods also rely on certain assumptions. In particular, the Wilcoxon-Mann-Whitney and Kruskal-Wallis tests postulate homoscedasticity across groups under the null hypothesis, and they have originally only been developed for continuous outcomes. In case of doubt, it is nevertheless expected that rank procedures are more robust and lead to more reliable results than their parametric counterparts. While this is true for deviations from normality, and while by now it is widely accepted that ordinal data should rather be analyzed using adequate rank-based methods than using normal theory procedures, we illustrate in various instances that nonparametric rank tests for more than two samples possess one noteworthy weakness. Namely, they are generally non-robust against changes from balanced to unbalanced designs. In particular, keeping the data generating processes fixed, we provide paradigms under which commonly used rank tests surprisingly yield

completely opposite test decisions when rearranging group sample sizes. These examples are in general not artificially generated to obtain paradoxical results, but even include homoscedastic normal models. This effect is completely undesirable, leading to the somewhat heretical question

• Are nonparametric rank procedures useful at all to handle questions for more than two groups?

In order to comprehensively answer this question, we carefully analyze the underlying nonparametric effects of the respective rank procedures. From this, we develop detailed guidelines for an adequate application of rank-based procedures. Moreover, we even state a simple solution for all these problems: Substituting ranks by closely related quantities, the so-called pseudo-ranks that have already been considered by Kulle (1999), Gao and Alvo (2005

), Gao, Alvo, Chen, and Li (2008), and in more detail by Thangavelu and Brunner (2007), and by Brunner, Konietschke, Pauly and Puri (2017). It should be noted that the motivation in these references was different, and that their authors had not been aware of the striking paradoxical properties that may arise when using classical rank tests. These surprising paradigms only appear in case of unbalanced designs since all rank procedures discussed below coincide with their respective pseudo-rank analogs in case of equal sample sizes. Pseudo-ranks are easy to compute, share the same advantageous properties of ranks and lead to reliable and robust inference procedures for a variety of factorial designs. Moreover, we can even obtain confidence intervals for (contrasts of) easy to interpret reasonable nonparametric effects. Thus, resolving the commonly raised disadvantage that

“nonparametric methods are geared toward hypothesis testing rather than estimation of effects”

(Whitley and Ball, 2002).

The paper is organized as follows. Notations are introduced in Section 2. Then in Section 3 some paradoxical results are presented in the one-way layout for the Kruskal-Wallis test and for the Hettmansperger-Norton trend test by means of certain tricky (non-transitive) dice. In the two-way layout, a paradoxical result for the Akritas-Arnold-Brunner test in a simple -design is presented in Section 4 using a homoscedastic normal shift model. The theoretical background of the paradoxical results is discussed in Section 5 and a solution of the problem by using pseudo-ranks is investigated in detail. Moreover, the computation of confidence intervals is discussed and applied to the data in Section 4. Section 6 provides a cautionary note for the problem of sub-group analysis where typically unequal sample sizes appear.

The paper closes with some guidelines for adequate application of rank procedures in the discussion and conclusions section. There, it is also briefly discussed that the use of pairwise and stratified rankings would make matters potentially worse.

## 2 Statistical Model and Notations

For samples of independent observations , the nonparametric relative effects which are underlying the rank tests are commonly defined as

 pi=∫HdFi, where H=1Nd∑r=1nrFr (1)

denotes the weighted mean distribution of the distributions  in the design. Here we use the so-called normalized version of the the distribution  to cover the cases of continuous, as well as non-continuous distributions in a unified approach. Thus, the case of ties does not require a separate consideration. This idea was first mentioned by Kruskal (1952) and later considered in more detail by Ruymgaart (1980). Akritas, Arnold, and Brunner (1997) extended this approach to factorial designs, while Akritas and Brunner (1997) and Brunner, Munzel, and Puri (1999) applied this technique to repeated measures and longitudinal data.

Easily interpreted,

is the probability that a randomly selected observation

from the weighted mean distribution  is smaller than a randomly selected observation  from the distribution  plus times the probability that both observations are equal. Thus, the quantity measures an effect of the distribution  with respect to the weighted mean distribution

. In the case of two independent random variables

and , Birnbaum and Klose (1957) had called the function the “relative distribution function” of and , assuming continuous distributions. Thus, its expectation

 ∫10tdL(t) = ∫∞−∞F1(s)dF2(s) = P(X1

is called a “relative effect” with an obvious adaption of the notation. In the same way, the quantity is called a “relative effect” of with respect to the weighted mean . This effect is a linear combination of the pairwise effects

. In vector notation, Equation (

1) is written as

 p = ∫HdF = W′n = ⎛⎜ ⎜⎝w11,⋯wd1⋮⋱⋮w1d,⋯wdd⎞⎟ ⎟⎠⋅⎛⎜ ⎜⎝n1/N⋮nd/N⎞⎟ ⎟⎠ = ⎛⎜ ⎜⎝p1⋮pd⎞⎟ ⎟⎠ . (2)

Here, denotes the vector of distribution functions, and

 W = (3)

is the matrix of pairwise effects . Note that and which follows from integration by parts. The relative effects are arranged in the vector and can be estimated consistently by the simple plug-in estimator

 ˆpi = ∫ˆHdˆFi = 1N(¯¯¯¯Ri⋅−12). (4)

Here, denotes the (normalized) empirical distribution of , , and their weighted mean. Finally, denotes the mean of the ranks

 Rik = 12+NˆH(Xik) = 12+d∑r=1nr∑ℓ=1c(Xik−Xrℓ), (5)

where the function for or , respectively, denotes the count function. The estimators are arranged in the vector

 ˆp = ∫ˆHdˆF = 1N(¯¯¯¯R⋅−121d), (6)

where is the vector of the empirical distributions, the vector of the rank means , and denotes the vector of 1s.

In the following sections we demonstrate that for groups, rank tests may lead to paradoxical results in case of unequal sample sizes. In particular, for factorial designs involving two or more factors, the nonparametric main effects and interactions (defined by the weighted relative effects ) may be severely biased.

## 3 Paradoxical Results in the One-Way Layout

To demonstrate some paradoxical results of rank tests for samples in the one-way layout, we consider the vector in (2) of the nonparametric effects , which are all equal to their mean iff or in matrix notation . Here, denotes the centering matrix, the -dimensional unit matrix, and the -dimensional matrix of s. Let denote the plug-in estimator of defined in (6). In order to detect whether the the are different, we study the asymptotic distribution of . This is obtained from the asymptotic equivalence theorem (see, e.g., Akritas et al., 1997; Brunner and Puri, 2001, 2002 or Brunner et al., 2017),

 √NTdˆp $̣\stackrel{{\scriptstyle\textstyle.}}{{=}}$ √NTd[¯¯¯¯Y⋅+¯¯¯¯Z⋅−2p]+√NTdp, (7)

where the symbol $̣\stackrel{{\scriptstyle\textstyle.}}{{=}}$ denotes asymptotic equivalence. Here, and are vectors of means of independent random vectors with expectation

. It follows from the central limit theorem that

has, asymptotically, a multivariate normal distribution with mean

and covariance matrix , where has a quite involved structure (for details see Brunner et al., 2017). Obviously, the multivariate distribution is shifted by from the origin . Therefore, we call the “multivariate non-centrality”, and a “univariate non-centrality” may be quantified by the quadratic form . In particular, we have iff . The actual (multivariate) shift of the distribution, depending on the total sample size , is , and the corresponding univariate non-centrality (depending on ) is then given by . From these considerations, it should become clear that as if . This defines the consistency region of a test based on .

Below, we will demonstrate that for the same vector of distributions , the non-centrality may be in case of equal sample sizes, while may be unequal to in case of unequal sample sizes. Under , tests based on (such as the Kruskal-Wallis test) reject the hypothesis

with approximately the pre-assigned type-I error probability

. If, however, the strong hypothesis is not true then the non-centrality may be or unequal to for the same set of distributions , since depends on the relative samples sizes . This means that for the same set of distributions  and unequal sample sizes the -value of the test may be arbitrary small if is large enough. However, the -value for the same test may be quite large for the same total sample size in case of equal sample sizes. Some well-known tests which have this paradoxical property are, for example, the Kruskal-Wallis test (1952), the Hettmansperger-Norton trend test (1987), and the Akritas-Arnold-Brunner test (1997).

As an example, consider the case of distributions where straightforward calculations show that

1. in case of equal sample sizes,

 p1=p2=p3 ⟺ w21=w32=1−w31=w, (8)
2. in general, however,

 p1=p2=p3 ⟺ w21=w32=w31=12. (9)

This means that in case of equal sample sizes, but in case of unequal sample sizes if .

We note that (9) follows under the strict hypothesis . However, this null hypothesis is not a necessary condition for (9) to hold. For example, if are symmetric distributions with the same center of symmetry then for . Thus, in this case, is also true for all samples sizes.

An example of discrete distributions generating the nonparametric effects in (8) is given by the probability mass functions

•  if and otherwise,

•  if and otherwise,

•  if and otherwise,

which are derived from some tricky dice (see, e.g., Peterson, 2002). For the distribution functions  defined by , above, it is easily seen that

 w21 = P(X2

Thus, and the vector of the weighted relative effects is given by

 p = ⎛⎜⎝p1p2p3⎞⎟⎠ = W′n = 1N ⎛⎜ ⎜ ⎜⎝12n1+n3+(n2−n3)wn1+12n2+(n3−n1)wn2+12n3+(n1−n2)w⎞⎟ ⎟ ⎟⎠. (13)

The weighted relative effects and the resulting non-centralities are listed in Table 1 for equal and some different unequal sample sizes.

Since for unequal sample sizes one obtains , it is only a question of choosing the total sample size large enough to reject the hypothesis  by the Kruskal-Wallis test with a probability arbitrary close to while in case of equal sample sizes for the probability of rejecting the hypothesis remains constant equal to (close to ) since in this case, . It may be noted that in general since the variance estimator of the Kruskal-Wallis statistic is computed under the strong hypothesis , which is obviously not true here. Thus, the scaling is not correct, and the Kruskal-Wallis test has a slightly different type-I error .

For the Hettmansperger-Norton trend test, the situation gets worse since for different ratios of sample sizes the nonparametric effects , , and may change their order. In setting (B) in Table 1 we have , while in setting (C) we have . Now consider the non-centrality of the Hettmansperger-Norton trend test which is a linear rank test. Let denote a vector reflecting the conjectured pattern. Then it follows from (7) that

 √Nc′Tdˆp $̣\stackrel{{\scriptstyle\textstyle.}}{{=}}$ √Nc′Td[¯¯¯¯Y⋅+¯¯¯¯Z⋅−2p]+√Nc′Tdp, (14)

where is a univariate non-centrality. If then it follows that and . If, however, then indicates a decreasing trend and an increasing trend. In the above discussed example, we obtain for setting (B) and for a conjectured pattern of for an increasing trend the non-centrality , indeed indicating an increasing trend. For setting (C) however, we obtain , indicating a decreasing trend. In case of setting (A) (equal sample sizes), since , and thus indicating no trend. Again it is a question of the total sample size to obtain the decision of a significantly decreasing trend for the first setting (B) of unequal sample sizes and for the second setting (C) the decision of a significantly increasing trend with a probability arbitrary close to 1 for the same distributions , and . In the fist case, for and in the second case, for . In case of equal sample sizes, the hypothesis of no trend is only rejected with a type-I error probability . Regarding , a similar remark applies as above for the Kruskal-Wallis test.

## 4 Paradoxical Results in the Two-Way Layout

In the previous section, paradoxical decisions by rank tests in case of unequal sample sizes were demonstrated for the one-way layout using large sample sizes and particular distributions leading to non-transitive decisions. In this section, we will show that in two-way layouts paradoxical results are already possible with rather small sample sizes and even in simple homoscedastic normal shift models. To this end, we consider the simple -design with two crossed factors and , each with two levels for and for . The observations , , are assumed to be independent.

The hypotheses of no nonparametric effects in terms of the distribution functions  are expressed as (see Akritas et al., 1997)

1. no main effect of factor -

2. no main effect of factor -

3. no interaction - ,

where in all three cases, denotes a function which is identical .

Let denote the vector of the distribution functions. Then the three hypotheses formulated above can be written in matrix notation as , where generates the hypothesis for the main effect , for the main effect , and for the interaction .

For testing these hypotheses, Akritas et al. (1997) derived rank procedures based on the statistic

 TN(c) = √Nc′ˆp = 1√N c′¯¯¯¯R⋅, (15)

where denotes the vector of the rank means within the four samples. They showed that under the hypothesis , the statistic  has, asymptotically, a normal distribution with mean and variance

 σ20 = 2∑i=12∑j=1Nnijσ2ij, (16)

where the unknown variances (see Akritas et al., 1997, for their explicit form) can be consistently estimated by

 1N2S2ij = 1N2(nij−1)nij∑k=1(Rijk−¯¯¯¯Rij⋅)2 (17)

and . For small sample sizes, the null distribution of can be approximated by a

-distribution with estimated degrees of freedom

 ˆf = S40∑2i=1∑2j=1(S2ij/nij)2/(nij−1), (18)

where . The non-centrality of is given by , and under the restrictive null hypothesis  it follows that .

To demonstrate a paradoxical result, we assume that the observations  are coming from the normal distributions

with equal standard deviations

and expectations . From the viewpoint of linear models, there is a main effect of , a main effect of , and no -interaction since . Since this is a homoscedastic linear model, the classical ANOVA should reject the hypotheses  and with a high probability if the total sample size is large enough. In contrast to that, the hypothesis  of no interaction is only rejected with the pre-selected type-I error probability . The non-centralities are given by , , and . The following two settings of samples sizes are considered:

1. ,  - (unbalanced)

2. ,  - (balanced).

First we demonstrate that the empirical characteristics of the two data sets, which are sampled from the same distributions, are nearly identical. Thus, potentially different results could not be explained by substantially different empirical distributions obtained by an “unhappy randomization”. The results of the comparisons are listed in Table 2.

We apply the classical ANOVA -statistic and the rank statistic to the same simulated data sets from Table 2 and compare the results for testing the three hypotheses , , and . We note that and . The decisions for , , and obtained by the ANOVA as well as by the rank tests based on are identical in all cases in the balanced setting. In the unbalanced setting, all decisions obtained by the parametric ANOVA are comparable to those in the balanced setting. However, the decision on the interaction based on the rank test is totally different from that obtained by the parametric ANOVA, as well as that obtained for the rank test in the balanced setting. The results are summarized in Table 3.

On the surface, the difference of the decisions in the unbalanced case could be explained by the fact that the nonparametric hypothesis  and the parametric hypothesis  are not identical and that this particular configuration of normal distributions falls into the gap between and . That is, here is true, but is not. It is surprising, however, that this explanation does not hold for the balanced case. The difference of the two -values and in Table 3 calls for an explanation.

The reason becomes clear when computing the vector in (2) for this particular example of the -design. To avoid fourfold indices we re-label the distributions , and as , and , respectively, and the sample sizes accordingly as , and . In the example, , , and , where . Thus, the probabilities of the pairwise comparisons are

 w=w12=w13=w24=w34 = Φ(−1τ√2) = 0.0392, w23=w32 = 12, v=w14 = Φ(−√2τ) ≈  0.

Finally, by observing , we obtain

 p =

and the nonparametric -interaction is described by

 cpAB = c′AB p = p1−p2−p3+p4 = n1−n4N(12−2w+v).

In this example, we obtain for equal samples sizes , while for we obtain and . This explains the small -value for unequal sample sizes in the example.

## 5 Explanation of the Paradoxical Results

### 5.1 Unweighted Effects and Pseudo-Ranks

The simple reason for the paradoxical results is the fact that even when all distribution functions underlying the observations are specified, the consistency regions of the rank tests based on are not fixed. Indeed, the consistency regions are defined by the weighted nonparametric relative effects which are not fixed model quantities by which hypotheses could be formulated or for which confidence intervals could be reasonably computed since the themselves generally depend on the sample sizes .

Thus, it appears reasonable to define different nonparametric effects which are fixed model quantities not depending on sample sizes. To this end let denote the unweighted mean distribution, and let . Easily interpreted, this nonparametric effect measures an effect of the distribution  with respect to the unweighted mean distribution  and is therefore a “fixed relative effect”. As is the mean of the pairwise nonparametric effects , it can be written in vector notation as the vector of row means of , that is,

 ψ = ∫GdF = W′⋅ 1d1d = ⎛⎜ ⎜⎝w11,⋯wd1⋮⋱⋮w1d,⋯wdd⎞⎟ ⎟⎠⋅1d1d = ⎛⎜ ⎜⎝ψ1⋮ψd⎞⎟ ⎟⎠. (20)

The fixed relative effects can be estimated consistently by the simple plug-in estimator

 ˆψi = ∫ˆGdˆFi = 1N(¯¯¯¯Rψi⋅−12), (21)

where denotes the unweighted mean of the empirical distributions , and the mean of the so-called pseudo-ranks

 ps-rank(Xik) = Rψik = 12+NˆG(Xik) = 12+Ndd∑r=11nrnr∑ℓ=1c(Xik−Xrℓ). (22)

Finally, the estimators are arranged in the vector

 ˆψ = ⎛⎜ ⎜ ⎜⎝ˆψ1⋮ˆψd⎞⎟ ⎟ ⎟⎠ = ∫ˆGdˆF = 1N(¯¯¯¯Rψ⋅−121d), (23)

where is the vector of the pseudo-rank means .

It may be noted that the pseudo-ranks have similar properties as the ranks . The properties given below follow from the definitions of the ranks and pseudo-ranks and by some straightforward algebra.

###### Lemma 1

Let denote observations arranged in groups each involving observations. Then, for and ,

1. If then and .

2. If then and .

3. .

4. If is a strictly monotone transformation of then,

1. .

5. .

6. If , then

7. Let , be independent and identically distributed random variables, then

 E(Rik) = E(Rψik) = N+12 .

### 5.2 Consistency Regions of Pseudo-Rank Procedures

As a solution to the paradoxical results discussed in Sections 3 and 4, we demonstrate that replacing the ranks with the pseudo-ranks leads to procedures that do not have these undesirable properties. The main reason is that pseudo-rank procedures are based on the (unweighted) relative effects which are fixed model quantities by which hypotheses can be formulated and for which confidence intervals can be derived. In case of equal sample sizes , , we do not obtain paradoxical results since in this case ranks and pseudo-ranks coincide, (see Lemma 1).

Pseudo-rank based inference procedures are obtained in much the same way as the common rank procedures, by using relations (21) and (22), which generally means substi