1 Introduction
Multiple and combined endpoints involving also nonnormal outcomes appear in many clinical trials in various areas in medicine where the outcome may be observed not only on a metric scale. In some cases, the outcome can be observed only on an ordinal or even dichotomous scale. Then the success of two therapies then can only be assessed by comparing the outcome of two arbitrary selected patients from the two therapy groups by ’better’, ’equal’ or ’worse’. Now let denote the outcome of therapy and denote the outcome of therapy . Then, for the three potential results

( better than ),

( equal or comparable to ),

( worse than )
these outcomes can be quantified by the three probabilities , and , where . The outcomes and can be measured or observed on an appropriate metric or ordinal scale.
To compare the underlying distributions and , the MannWhitney test (1947) is established since many decades. To test the hypothesis using the effect , this test had been developed for the case of continuous distributions, i.e. for the case of no ties where . The original MannWhitney test is consistent to alternatives of the form . Later, Putter (1955) considered the case where also ties are admitted () and showed that this modified test is based on the quantity
(1.1) 
and is consistent to alternatives of the form . Giving credit to Wilcoxon (1945, 1947), this test is also called WilcoxonMannWhitney test (WMWtest). The quantity can be well interpreted as the probability that therapy is better than (plus times the probability that the two therapies are comparable). For the clinician, however, it is less comprehensible since it is not obvious which would be the benefit for a patient if, e.g., . For this reason, Noether (1987) introduced the effect as a well comprehensible effect assuming continuous distributions (). Unfortunately, the quantity in this paper, had been denoted as ’oddsratio’ although these are odds since for . Moreover, as this paper appeared in a more theoretically oriented journal, this quantity has not been perceived by the practitioners and clinicians.
Fortunately, this idea was seized again by Pocock et al. (2012) as an intuitive and well comprehensible effect and was denoted as ’winratio’ (WR)
(1.2) 
Later, this quantity had been suggested by Wang und Pocock (2016) also for general nonnormal outcomes in clinical trials. Unlike Noether (1987), Wang and Pocock (2016), however, explicitly allowed for ties in the data. This means that without including the term in the definition of the quantity . Motivated by the consideration of effects for ordinal data, O’Brien and Castelloe (2006) suggested the quantity
as a well interpretable effect but did not consider this quantity in more detail. Later, Dong et al. (2019) discussed a statistic , where
denotes the estimator of the MannWhitney effect
in its generalized version including the case of ties (Putter, 1955). Since that time, the quantity got many divers denominations in the different areas of applications. For continuous distributions and , Birnbaum and Klose (1957) considered the function , which they denoted as a ’relative distribution of and ’. Since is the expectation of , i.e., , it is called ’relative effect’ with regard to Birnbaum and Klose (1957), see for example, Brunner and Puri (1996, 2001) and references cited therein. This terminology points out that describes an effect of with respect to . Its extension , which is also valid in case of ties reduces to for continuous distributions since in this case.When comparing two therapies and , a success of in relation to can be described by the probability , where means a success of over . Then the quantity is the chance to obtain a better result applying instead of . Therefore it shall be called success odds (SO) and is denoted by
(1.3) 
relating a success to the successodds . Basically, it is a simple modification of the winratio by adding half of the probability of ties, , to the numerator and the denominator extending the winratio (and in turn Noether’s ratio) to the case of ties. Note that quantifies the nonparametric effect of the WMWtest in case of ties and the consistency region of this test is given by .
It is the aim of this manuscript to investigate the properties of and in case of ties since they are included in the definition of but not in the definition of .
2 Comparison of two Treatments
2.1 Illustration of
First we consider the simple case of two treatments and as explained in the introduction. In general, it holds for and that
(2.4) 
where must be assumed. Equality in (2.4) holds if and only if

either (i.e. no ties)

or ().
Thus, in all other cases, , by definition. In the sequel, the impact of ties on the WR and on the SO shall be demonstrated by means of some examples. In the first example it is demonstrated that invalidates the WR but not the SO .
Example 2.1
(Pairwise comparisons of 3 treatments) In this example, three distributions and defined on an ordinal scale are compared. The ordinal categories are labeled by 1, 2 and 3 where the result is better than the results or and the result is better than . Let denote the distribution of the result for treatment , for treatment , and for treatment . The probabilities for the results and of the discrete distributions , are displayed in Table 1.
Table 1
The results for the treatments , and are described by the distributions , and with probabilities , and for the discrete outcomes , and .
Treatment  Probabilities  

The values of the relative effect , as well as of the effects and for the pairwise comparisons of the treatments , and are listed in Table 2.
Table 2
Pairwise comparisons of the treatments , and by means of the related relative effect , the SO , and the WR .
Comparison  

Obviously, it can be seen from Table 1 that treatment is much better than treatment and also better than treatment while treatment is slightly better than . This is well characterized by the relative effect and by the SO while the WR is not able to reasonably describe the successes of the treatments. It shall be noted that in the present example the pairwise comparisons cannot lead to nontransitive decisions since the three distributions are stochastically ordered. This is immediately seen from Table 1 where for all .
2.2 Metric and Ordinal Data
Example 2.2
(Coarsening of the meaurement scale) It shall be demonstrate by this example that a coarsening of the measurement can lead to an increase of the WR while the relative effect and SO may remain unchanged. When coarsening the measurement scale, the means in case of metric data and in turn their differences as well as the relative effects may change. Therefore, the distributions in this example are chosen in such a way that the means in treatment and in treatment remain unchanged in the three steps of the coarsening. In the same way, the relative effects , , remain unchanged which implies that the SO remain also unchanged. The proportion of the ties, however, increases in the three steps of the coarsening which leads to an increase of the WR . The coarsening of the measurements in the three steps was performed by rounding the measurements in the following table.
Table 3
Description of rounding measurements in three steps.
Case (1)
The measurements are observed with an accuracy of one place after
the decimal
point.
Case (2)
The measurements are rounded to integers.
Case (3)
The measurements within the interval are rounded to the
mean 3.5
of this interval while the other values remained integers.
Case (4)
The measurements within the interval are rounded to the
mean 3.5
of this interval while the other values remained integers.
Neither in a parametric model nor in a nonparametric model different treatment effects are obtained since the means in the treatments
and  and in turn the differences  as well as the relative effects remained the same in the three steps of the coarsening. Thus, the SO are identical in all steps. The WR, however, increases from to and becomes in the last step. The measurements and their coarsening are listed in Table 4 along with the means for the treatments and .Table 4
Measurements for the treatments and (first row) and the same measurements rounded as described above (rows ).
Measurements  Menas  
Case  Treatment  Treatment  
1  1.7  3.3  3.8  4.9  6.3  1.4  1.6  2.7  4.3  5.0  4  3 
2  2  3  4  5  6  1  2  3  4  5  4  3 
3  2  3.5  3.5  5  6  1  2  3.5  3.5  5  4  3 
4  3.5  3.5  3.5  3.5  6  1  3.5  3.5  3.5  3.5  4  3 
The proportion of ties , the differences, relative effects , SO and the WR are listed in Table 5.
Table 5
Changes of the WR for the comparison of the treatments and when coarsening the measurement scale where the proportion of ties is increased while the means as well as the relative effects remain unchanged.
[1ex]
Case  Diff.  Relative Effect  SO  WR  

1  0.00  1  0.68  2.125  2.125 
2  0.16  1  0.68  2.125  2.5 
3  0.24  1  0.68  2.125  2.8 
4  0.64  1  0.68  2.125 
Example 2.3
(Combining ordinal categories) In this example it is demonstrated how the WR might change if in an ordinal scale involving 6 ordinal categories 1, 2, 3, 4, 5, 6 the three categories 3, 4, 5 are combined in a new category 4. The relative effect and the SO remain unchanged in this case.
The probabilities of the results (Treatment ) and of the results (Treatment ), , are displayed in the upper part of Table 6, the proportion of ties , the relative effect , the SO as well as the WR are displayed in the lower part of Table 6. It may be noted that here, by definition since according to the explanations in Section 2.1.
Table 6
Probabilities for the ordinal scores 1 to 6 for the two treatments and , the proportion of ties , the relative effect , the SO , and the WR .
[1ex]
Score  
Treatment  1  2  3  4  5  6 
0.0  0.1  0.2  0.3  0.2  0.2  
0.3  0.3  0.1  0.2  0.1  0.0  
0.13  0.805  4.13  5.69 
The probabilities for the results (treatment ) and for the results (treatment ), , are listed in the upper part of Table 7. Here, the categories 3, 4, 5 are combined to a new category 4. In the lower part of Table 7, the proportion of ties , the relative effect , SO and WR are listed for the new categories. Compared with Table 6, the proportion of ties increased from to while the relative effect and in turn the SO remained unchanged but the WR increased from to .
Table 7
Probabilities of the combined ordinal scores 1, 2, 4, 6 for the two treatments and as well as the proportion of ties , the relative effect , the SO , the WR .
[1ex]
Score  

Treatment  1  2  4  6 
0.0  0.1  0.7  0.2  
0.3  0.3  0.4  0.0  
0.31  0.805  4.13  16.25 
2.3 Dichotomuous Data
In case of binary data for the treatments and with success probabilities and the quantities WR and SO are given by
where and thus by definition, . In this particular case, may be considerably smaller than which, in case of dichotomous data, equals the wellknown oddsratio
which is the ratio of the success rates of both treatments and while is based on the wellaccepted MannWhitney effect (relative effect) in its generalized form (Putter, 1955) which includes the case of ties.
Example 2.4
The aim of this example is to investigate whether the WinRatio (or the OddsRatio OR) and the SuccessOdds are intuitive and well interpretable quantities to describe a treatment effect of a therapy with respect to a therapy in case of dichotomous data. The success rates and as well as the success failures and are displayed in Figure 1.
Figure 1
The results of the two treatments and with dichotomous endpoints are displayed in the two graphs. The success probabilities are and in the lefthand graph and and int the righthand graph. Obviously, in the lefthand graph a clear difference of the successes of both therapies can be seen while in the righthand graph nearly no difference can be recognized between the two treatments. Moreover, in the righthand graph about of the results (or more precisely, ) are identical. These circumstances, however, are not depicted by the WinRatio since in both cases, . In contrast, the SuccessOdds intuitively depicts this actual situation since in the left graph is larger than in the right graph.
It appears that the winratio does neither provide an intuitive and well interpretable quantification of a treatment effect for dichotomous data nor it depicts an intuitive therapy success of therapy over therapy . In the sequel this is demonstrated by another example involving dichotomous data.
Example 2.5
Consider the case where the success of therapy is increased from to while the therapy success of therapy is kept fixed. Moreover, the percentage of ties remains nearly constant when ist fixed. The results are listed in the following table.
Table 8
Comparison between the WinRatio and the SuccessOdds to intuitively depict a superiority of therapy over therapy .
0.9  0.5  0.5  9.0  2.3 
0.95  0.5  0.5  19.0  2.6 
0.9  0.6  0.58  6.0  1.9 
0.95  0.6  0.59  12.7  2.1 
0.9  0.7  0.66  3.9  1.5 
0.95  0.7  0.68  8.1  1.7 
It appears from Table 8 that the WinRatio is approximately doubled independently of the success rate of therapy if the success rate of therapy is slightly increased from to . In a graphical representation, this difference would hardly be recognized.
In conclusion, it appears that in the case of dichotomous data, the winratio looses its appealing property to provide an intuitive quantification of a therapy effect as a chance to obtain a better result by applying therapy instead of therapy .
In the next section, the conclusions from the examples presented in the previous sections shall be summarized anf discussed.
2.4 Discussion of the WinRatio for Two Samples
Basically, the idea of the winratio to provide an intuitive and wellinterpretable effect when the result of a therapy can only be assessed by ’better’, ’worse’ or ’comparable’, is to be welcomed. However, the proportion of ties (comparable results) must be included in it’s definition since ties are allowed in the model. Otherwise, this quantity has some annoying properties.

The computation of the WR breaks down if while the SO depicts this case also and can only break down in the case where and , i.e. in the trivial case of a onepoint distribution (see the discussion in Section 2.1).

In case of dichotomous data, the winratio looses its appealing property to provide an intuitive quantification of a therapy effect in general. This, however, was the basic idea of the winratio. An example is discussed in Section 2.3.
Thus, the nice idea of the winratio should only be used in it’s modified or improved form of the successodds which appeared in the literature already in the conference paper by O’Brien and Castelloe (2006)  unfortunately without any further discussion. It extends Noether’s idea to provide an intuitive treatment effect for the MannWhitney test to the case of ties. Also, Dong et al. (2019) as well as Gasparyan and Koch (2019) consider the successodds but did not discuss the drawbacks of the winratio in case of ties. They have first been considered in detail in the talk by Brunner (2019) at the fallworkshop of the workinggroup ’Statistical Methods in Medical Research’ of the IBS / DR in Hamburg on November, 22, 1019.
In summarizing this discussion, the successodds can be recommended as an improved version of the winratio. Therefore, the next section briefly discusses tests and confidence intervals for the successodds and  for completeness  also for the winratio .
2.5 Tests and Confidence Intervals for and
2.5.1 WinRatio
The asymptotic distribution of and confidence intervals for have been derived by Bebu and Lachin (2016) and by Dong, Ballerstedt, and Vandemeulebroecke (2016) where also R and SASprograms to perform the computations are provided.
2.5.2 SuccessOdds
Estimators for the relative effect in (1.1) are available from the literature. A test of the hypothesis in a general model including also the case of ties is considered by Brunner and Munzel (2000), for example. This is known as the nonparametric BehrensFisher Problem.
It may be noted that the hypothesis is equivalent to . For more details we refer to Section 3.5 of the textbook by Brunner, Bathke, and Konietschke (2019) where also a rangepreserving confidence interval for is derived in Section 3.7.2. This can easily be extended to the successodds by the transformation using Cramér’s theorem and then backtransforming it to by . The Rpackage rankFD (CRAN), which performs the computations of these quantities, is described in Section A.2.2 of this book.
3 Comparison of Several Distributions
Pairwise comparisons using procedures based on the relative effect may
lead to nontransitive decisions. This is wellknown for the
WilcoxonMannWhitney test, for example, and holds also true for the quantities
and . This shall be demonstrated by the socalled trickydice
(see, e.g., Peterson, 2002 or Gardner, 1970). For example, the following three
dice
D1:  1  4  5  6  7  7 

D2:  3  3  4  5  6  9 
D3:  1  2  2  8  8  9 
lead to paradoxical results when pairwise comparisons are performed:

: , ,

: , ,

: , , ,
which means that die is better than , die is better than , and that finally die is better than . A solution of this nontransitivity problem might be comparing each die with a common casinotype die, for example a roller
representing a mixture of all three dice. This is basically the principle underlying the KruskalWallis test which compares each distribution with a weighted mean distribution
. In the example presented above one obtains since in all cases, . For a different common casinotype die, of course, one could obtain a different result.4 Stratified Designs
When using a stratified version of the WilcoxonMannWhitney test, for example van Elteren’s test (1960), a similar paradoxical decision might happen. An example is given in Thangavelu und Brunner (2007). This is briefly described below.

Therapy Stratum 1 0.57 1.32 1.36 2 0.57 1.32 1.33 3 0.57 1.32 1.33 Means 0.57 1.32 1.34 Therapy
Since the means and are averaged over the same three distributions on which the faces of the dice are based, it follows that and thus, and . Thus, both the therapies have equal successes. The means over the stratified versions of relative effects , the successodds , and the winratios averaged over the three strata, however, demonstrate a superiority of therapy (, and ) over therapy . In some sense, this is similar to Simpson’s paradox and is explained by the nontransitivity of the pairwise comparisons of the dice. Thus, different procedures must be developed for stratified designs which are beyond the scope of this manuscript and shall be discussed elsewhere.
5 Discussion and Outlook
The idea of the winratio to provide a well interpretable and clear effect for the clinician is excellent and to be welcomed. The quantity as it stands, however, has some strange and undesirable properties. Thus, the winratio should be slightly modified. Such a modification , called ’successodds’ is suggested here and it has been demonstrated that does not have the drawbacks of the the winratio in case of ties. Moreover, theoretical results are available from the literature by which the asymptotic distribution of an estimator of the successodds is easily obtained. Thus, a test of the hypothesis as well as a confidence interval for can be derived using Cramér’s theorem (see, e.g., Brunner et al., 2019, Sections 3.5, 3.7.2, 7.4, 7.5, and 7.6.1).
The generalization to several samples and stratified designs, however, is not straightforward since decisions based on or may be nontransitive as briefly demonstrated by counterexamples in Sections 3 and 4. Reasonable extensions of the successodds to several samples, stratified and factorial designs are currently under investigation.
6 Acknowledgment and Remarks
The topic of this manuscript had been presented in a talk by the author at the fallworkshop of the workinggroup on ’Statistical Methods in Medicine’ of the IBS/DR in Hamburg on November 22 in 2019. The author would like to thank the audience of this workshop for helpful comments and remarks. A handout in German language to that talk was available for the workshop. The present English version is based on this handout.
7 References

Bebu, I., Lachin, J.M. (2016). Large sample inference for a win ratio analysis of a composite outcome based on prioritized components. Biostatistics 17, 178–187.

Birnbaum, Z. W. and Klose, O. M.
(1957). Bounds for the Variance of the MannWhitney Statistic.
Annals of Mathematical Statistics 28, 933–945. 
Brunner, E. (2019). WinRatio und MannWhitneyOdds. Talk presented at the fallworkshop of the workinggroup ’Statistical Methods in Medical Research’ of the IBS / DR in Hamburg on November, 22, 1019.
https://www.unimedizinmainz.de/smde/herbstworkshop2019.html 
Brunner, E., Bathke, A. C., and Konietschke, F. (2019). Rank and PseudoRank Procedures in Factorial Designs  Using R and SAS  Independent Observations. Springer Series in Statistics, Springer, Heidelberg.

Brunner, E. and Munzel, U. (2000). The Nonparametric BehrensFisher Problem: Asymptotic Theory and a SmallSample Approximation. Biometrical Journal 42, 17–25.

Brunner, E. and Puri, M. L. (1996). Nonparametric methods in design and analysis of experiments. Handbook of Statistics (S. Ghosh and C.R. Rao, Eds.) 13, 631–703.

Brunner, E. and Puri, M. L. (2001). Nonparametric Methods in Factorial Designs. Statistical Papers 42, 1–52.

Dong, G., Li, D., Ballerstedt, S., and Vandemeulebroecke, M. (2016). A generalized analytic solution to the win ratio to analyze a composite endpoint considering the clinical importance order among components. Pharmaceutical Statistics 15, 430–437.

Dong, G., Hoaglin, D.C., Qiu, J., Matsouaka, R.A., Chang, Y.W., Wang, J., Vandemeulebroecke, M. (2019). The Win Ratio: On Interpretation and Handling of Ties. Statistics in Biopharmaceutical Research ,
DOI: 10.1080/19466315.2019.1575279 
Gardner, M.
(1970). The paradox of the nontransitive dice and the elusive principle of indifference.
Scientific American 223, 110–114. 
Gasparyan, S.B., Folkvaljon, F., Bengtsson, O., Buenconsejo, J. Koch. G.G. (2019). Adjusted Win Ratio with Stratification: Calculation Methods and Interpretation. arXiv:1912.09204v1 [stat.ME] 19 Dec 2019.

Mann, H. B. and Whitney, D. R.
(1947). On a test of whether one of two random variables is stochastically larger then the other.
Annals of Mathematical Statistics 18, 50–60. 
Noether, G. E. (1987). Sample Size Determination for Some Common Nonparametric Tests. Journal of the American Statistical Association 85, 645–647.

O’Brien, R. G., and Castelloe, J. M. (2006). Exploiting the Link Between the WilcoxonMannWhitney Test and a Simple Odds Statistic. In: Proceedings of the ThirtyFirst Annual SAS Users Group International Conference, Cary, NC: SAS Institute Inc.
http://www2.sas.com/proceedings/sugi31/20931.pdf . 
Peterson, I. (2002). Tricky Dice Revisited. Science News 161,
http://www.sciencenews.org/article/trickydicerevisited. 
Pocock, S.J., Ariti, C.A., Collier, T.J., Wang, D. (2012). The win ratio: a new approach to the analysis of composite endpoints in clinical trials based on clinical priorities. European heart journal 33, 176–182.

Putter, J. (1955). The Treatment of Ties in Some Nonparametric Tests. The Annals of Mathematical Statistics 26, 368–386.

Thangavelu, K. and Brunner, E. (2007). Wilcoxon MannWhitney Test for Stratified Samples and Efron’s Paradox Dice. Journal of Statistical Planning and Inference 137, 720–737.

Van Elteren, P. H. (1960). On the Combination of Independent TwoSample Tests of Wilcoxon. Bulletin of the International Statistical Institute 37, 351–361.

Wang, D., Pocock, S.J. (2016). A win ratio approach to comparing continuous nonnormal outcomes in clinical trials. Pharmaceutical Statistics 15, 238–245.

Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometric Bulletin 1, 80–83.

Wilcoxon, F. (1947). Probability Tables for Individual Comparisons by Ranking Methods. Biometrics 3, 119–122.
Comments
There are no comments yet.