 # Success-Odds: An improved Win-Ratio

Multiple and combined endpoints involving also non-normal outcomes appear in many clinical trials in various areas in medicine. In some cases, the outcome can be observed only on an ordinal or dichotomous scale. Then the success of two therapies is assessed by comparing the outcome of two randomly selected patients from the two therapy groups by 'better', 'equal' or 'worse'. These outcomes can be described by the probabilities p^-=P(X<Y), p_0=P(X=Y), and p^+ =P(X > Y). For a clinician, however, these quantities are less intuitive. Therefore, Noether (1987) introduced the quantity λ=p^+ / p^- assuming continuous distributions. The same quantity was used by Pocock et al. (2012) and by Wang and Pocock (2016) also for general non-normal outcomes and has been called 'win-ratio' λ_WR. Unlike Noether (1987), Wang and Pocock (2016) explicitly allowed for ties in the data. It is the aim of this manuscript to investigate the properties of λ_WR in case of ties. It turns out that it has the strange property of becoming larger if the data are observed less accurately, i.e. include more ties. Thus, in case of ties, the win-ratio looses its appealing property to describe and quantify an intuitive and well interpretable treatment effect. Therefore, a slight modification of λ_WR = θ / (1-θ) is suggested, namely the so-called 'success-odds' where θ=p^+ + 1/2 p_0 is called a success of a therapy if θ>1/2. In the case of no ties, λ_SO is identical to λ_WR. A test for the hypothesis λ_SO=1 and range preserving confidence intervals for λ_SO are derived. By two counterexamples it is demonstrated that generalizations of both the win-ratio and the success-odds to more than two treatments or to stratified designs are not straightforward and need more detailed considerations.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Multiple and combined endpoints involving also non-normal outcomes appear in many clinical trials in various areas in medicine where the outcome may be observed not only on a metric scale. In some cases, the outcome can be observed only on an ordinal or even dichotomous scale. Then the success of two therapies then can only be assessed by comparing the outcome of two arbitrary selected patients from the two therapy groups by ’better’, ’equal’ or ’worse’. Now let denote the outcome of therapy and denote the outcome of therapy . Then, for the three potential results

1. ( better than ),

2. ( equal or comparable to ),

3. ( worse than )

these outcomes can be quantified by the three probabilities , and , where . The outcomes and can be measured or observed on an appropriate metric or ordinal scale.

To compare the underlying distributions  and , the Mann-Whitney test (1947) is established since many decades. To test the hypothesis  using the effect , this test had been developed for the case of continuous distributions, i.e. for the case of no ties where . The original Mann-Whitney test is consistent to alternatives of the form . Later, Putter (1955) considered the case where also ties are admitted () and showed that this modified test is based on the quantity

 θ = p++12p0 = P(X>Y)+12P(X=Y) (1.1)

and is consistent to alternatives of the form . Giving credit to Wilcoxon (1945, 1947), this test is also called Wilcoxon-Mann-Whitney test (WMW-test). The quantity can be well interpreted as the probability that therapy is better than (plus -times the probability that the two therapies are comparable). For the clinician, however, it is less comprehensible since it is not obvious which would be the benefit for a patient if, e.g., . For this reason, Noether (1987) introduced the effect as a well comprehensible effect assuming continuous distributions (). Unfortunately, the quantity in this paper, had been denoted as ’odds-ratio’ although these are odds since for . Moreover, as this paper appeared in a more theoretically oriented journal, this quantity has not been perceived by the practitioners and clinicians.

Fortunately, this idea was seized again by Pocock et al. (2012) as an intuitive and well comprehensible effect and was denoted as ’win-ratio’ (WR)

 λWR = P(X>Y)/P(X

Later, this quantity had been suggested by Wang und Pocock (2016) also for general non-normal outcomes in clinical trials. Unlike Noether (1987), Wang and Pocock (2016), however, explicitly allowed for ties in the data. This means that without including the term in the definition of the quantity . Motivated by the consideration of effects for ordinal data, O’Brien and Castelloe (2006) suggested the quantity

 λWMW = P(X>Y)+12P(X=Y)P(X

as a well interpretable effect but did not consider this quantity in more detail. Later, Dong et al. (2019) discussed a statistic , where

denotes the estimator of the Mann-Whitney effect

in its generalized version including the case of ties (Putter, 1955). Since that time, the quantity got many divers denominations in the different areas of applications. For continuous distributions  and , Birnbaum and Klose (1957) considered the function , which they denoted as a ’relative distribution of and . Since is the expectation of , i.e., , it is called ’relative effect’ with regard to Birnbaum and Klose (1957), see for example, Brunner and Puri (1996, 2001) and references cited therein. This terminology points out that describes an effect of with respect to . Its extension , which is also valid in case of ties reduces to for continuous distributions since in this case.

When comparing two therapies and , a success of in relation to can be described by the probability , where means a success of over . Then the quantity is the chance to obtain a better result applying instead of . Therefore it shall be called success odds (SO) and is denoted by

 λSO = θ/(1−θ) = p++12p0p−+12p0 (1.3)

relating a success to the success-odds . Basically, it is a simple modification of the win-ratio by adding half of the probability of ties, , to the numerator and the denominator extending the win-ratio (and in turn Noether’s ratio) to the case of ties. Note that quantifies the nonparametric effect of the WMW-test in case of ties and the consistency region of this test is given by .

It is the aim of this manuscript to investigate the properties of and in case of ties since they are included in the definition of but not in the definition of .

## 2 Comparison of two Treatments

### 2.1 Illustration of P(x<y)=0

First we consider the simple case of two treatments and as explained in the introduction. In general, it holds for and that

 λWR = P(X>Y)P(X

where must be assumed. Equality in (2.4) holds if and only if

1. either    (i.e. no ties)

2. or    ().

Thus, in all other cases, , by definition. In the sequel, the impact of ties on the WR and on the SO shall be demonstrated by means of some examples. In the first example it is demonstrated that invalidates the WR but not the SO .

###### Example 2.1

(Pairwise comparisons of 3 treatments)   In this example, three distributions  and defined on an ordinal scale are compared. The ordinal categories are labeled by 1, 2 and 3 where the result is better than the results or and the result is better than . Let denote the distribution of the result for treatment , for treatment , and for treatment . The probabilities for the results and of the discrete distributions , are displayed in Table 1.

###### Table 1

The results for the treatments , and are described by the distributions , and with probabilities , and for the discrete outcomes , and .

Treatment Probabilities

The values of the relative effect , as well as of the effects and for the pairwise comparisons of the treatments , and are listed in Table 2.

###### Table 2

Pairwise comparisons of the treatments , and by means of the related relative effect , the SO , and the WR .

Comparison

Obviously, it can be seen from Table 1 that treatment is much better than treatment and also better than treatment while treatment is slightly better than . This is well characterized by the relative effect and by the SO while the WR is not able to reasonably describe the successes of the treatments. It shall be noted that in the present example the pairwise comparisons cannot lead to non-transitive decisions since the three distributions are stochastically ordered. This is immediately seen from Table 1 where for all .

### 2.2 Metric and Ordinal Data

###### Example 2.2

(Coarsening of the meaurement scale)  It shall be demonstrate by this example that a coarsening of the measurement can lead to an increase of the WR while the relative effect and SO may remain unchanged. When coarsening the measurement scale, the means in case of metric data and in turn their differences as well as the relative effects may change. Therefore, the distributions in this example are chosen in such a way that the means in treatment  and in treatment  remain unchanged in the three steps of the coarsening. In the same way, the relative effects , , remain unchanged which implies that the SO remain also unchanged. The proportion of the ties, however, increases in the three steps of the coarsening which leads to an increase of the WR . The coarsening of the measurements in the three steps was performed by rounding the measurements in the following table.

###### Table 3

Description of rounding measurements in three steps.
Case (1) The measurements are observed with an accuracy of one place after the decimal point. Case (2) The measurements are rounded to integers. Case (3) The measurements within the interval are rounded to the mean 3.5 of this interval while the other values remained integers. Case (4) The measurements within the interval are rounded to the mean 3.5 of this interval while the other values remained integers.

Neither in a parametric model nor in a nonparametric model different treatment effects are obtained since the means in the treatments

and - and in turn the differences - as well as the relative effects remained the same in the three steps of the coarsening. Thus, the SO are identical in all steps. The WR, however, increases from to and becomes in the last step. The measurements and their coarsening are listed in Table 4 along with the means for the treatments  and .

###### Table 4

Measurements for the treatments  and (first row) and the same measurements rounded as described above (rows ).

 Measurements Menas Case Treatment  A  (x1,…,x5) Treatment B  (y1,…,y5) A B 1 1.7 3.3 3.8 4.9 6.3 1.4 1.6 2.7 4.3 5.0 4 3 2 2 3 4 5 6 1 2 3 4 5 4 3 3 2 3.5 3.5 5 6 1 2 3.5 3.5 5 4 3 4 3.5 3.5 3.5 3.5 6 1 3.5 3.5 3.5 3.5 4 3

The proportion of ties , the differences, relative effects , SO and the WR are listed in Table 5.

###### Table 5

Changes of the WR for the comparison of the treatments  and when coarsening the measurement scale where the proportion of ties is increased while the means as well as the relative effects remain unchanged.

[1ex]

Case Diff. Relative Effect SO WR
1 0.00 1 0.68 2.125   2.125
2 0.16 1 0.68 2.125   2.5
3 0.24 1 0.68 2.125   2.8
4 0.64 1 0.68 2.125
###### Example 2.3

(Combining ordinal categories)  In this example it is demonstrated how the WR might change if in an ordinal scale involving 6 ordinal categories 1, 2, 3, 4, 5, 6 the three categories 3, 4, 5 are combined in a new category 4. The relative effect and the SO remain unchanged in this case.

The probabilities of the results (Treatment ) and of the results (Treatment ), , are displayed in the upper part of Table 6, the proportion of ties , the relative effect , the SO as well as the WR are displayed in the lower part of Table 6. It may be noted that here, by definition since according to the explanations in Section 2.1.

###### Table 6

Probabilities for the ordinal scores 1 to 6 for the two treatments  and , the proportion of ties , the relative effect , the SO , and the WR .

[1ex]

 Treatment A B p0=P(X=Y) 0.13 Score 1 2 3 4 5 6 0.0 0.1 0.2 0.3 0.2 0.2 0.3 0.3 0.1 0.2 0.1 0.0 θ λSO λWR 0.805 4.13 5.69

The probabilities for the results (treatment ) and for the results (treatment ), , are listed in the upper part of Table 7. Here, the categories 3, 4, 5 are combined to a new category 4. In the lower part of Table 7, the proportion of ties , the relative effect , SO and WR are listed for the new categories. Compared with Table 6, the proportion of ties increased from to while the relative effect and in turn the SO remained unchanged but the WR increased from to .

###### Table 7

Probabilities of the combined ordinal scores 1, 2, 4, 6 for the two treatments  and as well as the proportion of ties , the relative effect , the SO , the WR .

[1ex]

Score
Treatment 1 2 4 6
0.0 0.1 0.7 0.2
0.3 0.3 0.4 0.0
0.31 0.805 4.13 16.25

### 2.3 Dichotomuous Data

In case of binary data for the treatments and with success probabilities and the quantities WR and SO are given by

 λWR = qA(1−qB)qB(1−qA)andλSO = qA(1−qB)+p0/2qB(1−qA)+p0/2 ,

where and thus by definition, . In this particular case, may be considerably smaller than which, in case of dichotomous data, equals the well-known odds-ratio

 OR(A,B) = qA1−qA/qB1−qB = qA(1−qB)qB(1−qA),

which is the ratio of the success rates of both treatments and while is based on the well-accepted Mann-Whitney effect (relative effect) in its generalized form (Putter, 1955) which includes the case of ties.

###### Example 2.4

The aim of this example is to investigate whether the Win-Ratio (or the Odds-Ratio OR) and the Success-Odds are intuitive and well interpretable quantities to describe a treatment effect of a therapy with respect to a therapy in case of dichotomous data. The success rates and as well as the success failures and are displayed in Figure 1.

###### Figure 1

The results of the two treatments and with dichotomous endpoints are displayed in the two graphs. The success probabilities are and in the left-hand graph and and int the right-hand graph. Obviously, in the left-hand graph a clear difference of the successes of both therapies can be seen while in the right-hand graph nearly no difference can be recognized between the two treatments. Moreover, in the right-hand graph about of the results (or more precisely, ) are identical. These circumstances, however, are not depicted by the Win-Ratio since in both cases, . In contrast, the Success-Odds intuitively depicts this actual situation since in the left graph is larger than in the right graph.

It appears that the win-ratio does neither provide an intuitive and well interpretable quantification of a treatment effect for dichotomous data nor it depicts an intuitive therapy success of therapy  over therapy . In the sequel this is demonstrated by another example involving dichotomous data.

###### Example 2.5

Consider the case where the success of therapy  is increased from to while the therapy success of therapy is kept fixed. Moreover, the percentage of ties remains nearly constant when ist fixed. The results are listed in the following table.

###### Table 8

Comparison between the Win-Ratio and the Success-Odds to intuitively depict a superiority of therapy over therapy .

0.9 0.5         0.5 9.0         2.3
0.95 0.5         0.5 19.0         2.6
0.9 0.6         0.58 6.0         1.9
0.95 0.6         0.59 12.7         2.1
0.9 0.7         0.66 3.9         1.5
0.95 0.7         0.68 8.1         1.7

It appears from Table 8 that the Win-Ratio is approximately doubled independently of the success rate of therapy if the success rate of therapy is slightly increased from to . In a graphical representation, this difference would hardly be recognized.

In conclusion, it appears that in the case of dichotomous data, the win-ratio looses its appealing property to provide an intuitive quantification of a therapy effect as a chance to obtain a better result by applying therapy  instead of therapy .

In the next section, the conclusions from the examples presented in the previous sections shall be summarized anf discussed.

### 2.4 Discussion of the Win-Ratio for Two Samples

Basically, the idea of the win-ratio to provide an intuitive and well-interpretable effect when the result of a therapy can only be assessed by ’better’, ’worse’ or ’comparable’, is to be welcomed. However, the proportion of ties (comparable results) must be included in it’s definition since ties are allowed in the model. Otherwise, this quantity has some annoying properties.

1. The computation of the WR breaks down if while the SO depicts this case also and can only break down in the case where and , i.e. in the trivial case of a one-point distribution (see the discussion in Section 2.1).

2. It is counterintuitive that an effect can increase if the measurements are less precise or the data are observed less accurately. This is demonstrated in Examples 2.2 and 2.3. Also such a property would offer a possibility to manipulations.

3. In case of dichotomous data, the win-ratio looses its appealing property to provide an intuitive quantification of a therapy effect in general. This, however, was the basic idea of the win-ratio. An example is discussed in Section 2.3.

Thus, the nice idea of the win-ratio should only be used in it’s modified or improved form of the success-odds which appeared in the literature already in the conference paper by O’Brien and Castelloe (2006) - unfortunately without any further discussion. It extends Noether’s idea to provide an intuitive treatment effect for the Mann-Whitney test to the case of ties. Also, Dong et al. (2019) as well as Gasparyan and Koch (2019) consider the success-odds but did not discuss the drawbacks of the win-ratio in case of ties. They have first been considered in detail in the talk by Brunner (2019) at the fall-workshop of the working-group ’Statistical Methods in Medical Research’ of the IBS / DR in Hamburg on November, 22, 1019.

In summarizing this discussion, the success-odds can be recommended as an improved version of the win-ratio. Therefore, the next section briefly discusses tests and confidence intervals for the success-odds and - for completeness - also for the win-ratio .

### 2.5 Tests and Confidence Intervals for λWR and λSO

#### 2.5.1 Win-Ratio λWR

The asymptotic distribution of and confidence intervals for have been derived by Bebu and Lachin (2016) and by Dong, Ballerstedt, and Vandemeulebroecke (2016) where also R- and SAS-programs to perform the computations are provided.

#### 2.5.2 Success-Odds λSO

Estimators for the relative effect in (1.1) are available from the literature. A test of the hypothesis  in a general model including also the case of ties is considered by Brunner and Munzel (2000), for example. This is known as the nonparametric Behrens-Fisher Problem.

It may be noted that the hypothesis  is equivalent to . For more details we refer to Section 3.5 of the textbook by Brunner, Bathke, and Konietschke (2019) where also a range-preserving confidence interval for is derived in Section 3.7.2. This can easily be extended to the success-odds by the transformation using Cramér’s -theorem and then back-transforming it to by . The R-package rankFD (CRAN), which performs the computations of these quantities, is described in Section A.2.2 of this book.

## 3 Comparison of Several Distributions

Pairwise comparisons using procedures based on the relative effect may lead to non-transitive decisions. This is well-known for the Wilcoxon-Mann-Whitney test, for example, and holds also true for the quantities and . This shall be demonstrated by the so-called tricky-dice (see, e.g., Peterson, 2002 or Gardner, 1970). For example, the following three dice

 D1: D2: D3: 1 4 5 6 7 7 3 3 4 5 6 9 1 2 2 8 8 9

1. : ,  ,

2. : ,  ,

3. : ,  ,      ,

which means that die is better than , die is better than , and that finally die is better than . A solution of this non-transitivity problem might be comparing each die with a common casino-type die, for example a roller

representing a mixture of all three dice. This is basically the principle underlying the Kruskal-Wallis test which compares each distribution with a weighted mean distribution

. In the example presented above one obtains since in all cases, . For a different common casino-type die, of course, one could obtain a different result.

## 4 Stratified Designs

When using a stratified version of the Wilcoxon-Mann-Whitney test, for example van Elteren’s test (1960), a similar paradoxical decision might happen. An example is given in Thangavelu und Brunner (2007). This is briefly described below.

• Therapy Stratum (i) A B 1 D1 D2 0.57 1.32 1.36 2 D2 D3 0.57 1.32 1.33 3 D3 D1 0.57 1.32 1.33 Means ¯¯¯¯¯DA=¯¯¯¯¯DB 0.57 1.32 1.34 ⇒  Therapy A>B

Since the means and are averaged over the same three distributions on which the faces of the dice are based, it follows that and thus, and . Thus, both the therapies have equal successes. The means over the stratified versions of relative effects , the success-odds , and the win-ratios averaged over the three strata, however, demonstrate a superiority of therapy (, and ) over therapy . In some sense, this is similar to Simpson’s paradox and is explained by the non-transitivity of the pairwise comparisons of the dice. Thus, different procedures must be developed for stratified designs which are beyond the scope of this manuscript and shall be discussed elsewhere.

## 5 Discussion and Outlook

The idea of the win-ratio to provide a well interpretable and clear effect for the clinician is excellent and to be welcomed. The quantity as it stands, however, has some strange and undesirable properties. Thus, the win-ratio should be slightly modified. Such a modification , called ’success-odds’ is suggested here and it has been demonstrated that does not have the drawbacks of the the win-ratio in case of ties. Moreover, theoretical results are available from the literature by which the asymptotic distribution of an estimator of the success-odds is easily obtained. Thus, a test of the hypothesis  as well as a confidence interval for can be derived using Cramér’s -theorem (see, e.g., Brunner et al., 2019, Sections 3.5, 3.7.2, 7.4, 7.5, and 7.6.1).

The generalization to several samples and stratified designs, however, is not straightforward since decisions based on or may be non-transitive as briefly demonstrated by counter-examples in Sections 3 and 4. Reasonable extensions of the success-odds to several samples, stratified and factorial designs are currently under investigation.

## 6 Acknowledgment and Remarks

The topic of this manuscript had been presented in a talk by the author at the fall-workshop of the working-group on ’Statistical Methods in Medicine’ of the IBS/DR in Hamburg on November 22 in 2019. The author would like to thank the audience of this workshop for helpful comments and remarks. A handout in German language to that talk was available for the workshop. The present English version is based on this handout.

## 7 References

Bebu, I., Lachin, J.M. (2016). Large sample inference for a win ratio analysis of a composite outcome based on prioritized components. Biostatistics 17, 178–187.

Birnbaum, Z. W. and Klose, O. M.

(1957). Bounds for the Variance of the Mann-Whitney Statistic.

Annals of Mathematical Statistics 28, 933–945.

Brunner, E. (2019). Win-Ratio und Mann-Whitney-Odds. Talk presented at the fall-workshop of the working-group ’Statistical Methods in Medical Research’ of the IBS / DR in Hamburg on November, 22, 1019.

Brunner, E., Bathke, A. C., and Konietschke, F. (2019). Rank- and Pseudo-Rank Procedures in Factorial Designs - Using R and SAS - Independent Observations. Springer Series in Statistics, Springer, Heidelberg.

Brunner, E. and Munzel, U. (2000). The Nonparametric Behrens-Fisher Problem: Asymptotic Theory and a Small-Sample Approximation. Biometrical Journal 42, 17–25.

Brunner, E. and Puri, M. L. (1996). Nonparametric methods in design and analysis of experiments. Handbook of Statistics (S. Ghosh and C.R. Rao, Eds.) 13, 631–703.

Brunner, E. and Puri, M. L. (2001). Nonparametric Methods in Factorial Designs. Statistical Papers 42, 1–52.

Dong, G., Li, D., Ballerstedt, S., and Vandemeulebroecke, M. (2016). A generalized analytic solution to the win ratio to analyze a composite endpoint considering the clinical importance order among components. Pharmaceutical Statistics 15, 430–437.

Dong, G., Hoaglin, D.C., Qiu, J., Matsouaka, R.A., Chang, Y.-W., Wang, J., Vandemeulebroecke, M. (2019). The Win Ratio: On Interpretation and Handling of Ties. Statistics in Biopharmaceutical Research ,
DOI: 10.1080/19466315.2019.1575279

Gardner, M.

(1970). The paradox of the nontransitive dice and the elusive principle of indifference.

Scientific American 223, 110–114.

Gasparyan, S.B., Folkvaljon, F., Bengtsson, O., Buenconsejo, J. Koch. G.G. (2019). Adjusted Win Ratio with Stratification: Calculation Methods and Interpretation. arXiv:1912.09204v1 [stat.ME] 19 Dec 2019.

Mann, H. B. and Whitney, D. R.

(1947). On a test of whether one of two random variables is stochastically larger then the other.

Annals of Mathematical Statistics 18, 50–60.

Noether, G. E. (1987). Sample Size Determination for Some Common Nonparametric Tests. Journal of the American Statistical Association 85, 645–647.

O’Brien, R. G., and Castelloe, J. M. (2006). Exploiting the Link Between the Wilcoxon-Mann-Whitney Test and a Simple Odds Statistic. In: Proceedings of the Thirty-First Annual SAS Users Group International Conference, Cary, NC: SAS Institute Inc.

Peterson, I. (2002). Tricky Dice Revisited. Science News 161,
http://www.sciencenews.org/article/tricky-dice-revisited.

Pocock, S.J., Ariti, C.A., Collier, T.J., Wang, D. (2012). The win ratio: a new approach to the analysis of composite endpoints in clinical trials based on clinical priorities. European heart journal 33, 176–182.

Putter, J. (1955). The Treatment of Ties in Some Nonparametric Tests. The Annals of Mathematical Statistics 26, 368–386.

Thangavelu, K. and Brunner, E. (2007). Wilcoxon Mann-Whitney Test for Stratified Samples and Efron’s Paradox Dice. Journal of Statistical Planning and Inference 137, 720–737.

Van Elteren, P. H. (1960). On the Combination of Independent Two-Sample Tests of Wilcoxon. Bulletin of the International Statistical Institute 37, 351–361.

Wang, D., Pocock, S.J. (2016). A win ratio approach to comparing continuous non-normal outcomes in clinical trials. Pharmaceutical Statistics 15, 238–245.

Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometric Bulletin 1, 80–83.

Wilcoxon, F. (1947). Probability Tables for Individual Comparisons by Ranking Methods. Biometrics 3, 119–122.