# p-Value as the Strength of Evidence Measured by Confidence Distribution

The notion of p-value is a fundamental concept in statistical inference and has been widely used for reporting outcomes of hypothesis tests. However, p-value is often misinterpreted, misused or miscommunicated in practice. Part of the issue is that existing definitions of p-value are often derived from constructions under specific settings, and a general definition that directly reflects the evidence of the null hypothesis is not yet available. In this article, we first propose a general and rigorous definition of p-value that fulfills two performance-based characteristics. The performance-based definition subsumes all existing construction-based definitions of the p-value, and justifies their interpretations. The paper further presents a specific approach based on confidence distribution to formulate and calculate p-values. This specific way of computing p values has two main advantages. First, it is applicable for a wide range of hypothesis testing problems, including the standard one- and two-sided tests, tests with interval-type null, intersection-union tests, multivariate tests and so on. Second, it can naturally lead to a coherent interpretation of p-value as evidence in support of the null hypothesis, as well as a meaningful measure of degree of such support. In particular, it places a meaning of a large p-value, e.g. p-value of 0.8 has more support than 0.5. Numerical examples are used to illustrate the wide applicability and computational feasibility of our approach. We show that our proposal is effective and can be applied broadly, without further consideration of the form/size of the null space. As for existing testing methods, the solutions have not been available or cannot be easily obtained.

## Authors

• 10 publications
• 2 publications
• 6 publications
• ### P-value: A Bless or A Curse for Evidence-Based Studies?

As a convention, p-value is often computed in frequentist hypothesis tes...
09/22/2018 ∙ by Haolun Shi, et al. ∙ 0

• ### Seeking evidence of absence: Reconsidering tests of model assumptions

Statistical tests can only reject the null hypothesis, never prove it. H...
05/08/2018 ∙ by Alyssa Bilinski, et al. ∙ 0

• ### A Likelihood-based Alternative to Null Hypothesis Significance Testing

The logical and practical difficulties associated with research interpre...
06/06/2018 ∙ by Nicholas Adams, et al. ∙ 0

• ### Safe Testing

We present a new theory of hypothesis testing. The main concept is the S...
06/18/2019 ∙ by Peter Grünwald, et al. ∙ 0

• ### Inferência Baseada em Magnitudes na investigação em Ciências do Esporte. A necessidade de romper com os testes de hipótese nula e os valores de p

Research in Sports Sciences is supported often by inferences based on th...
01/30/2018 ∙ by Rui Marcelino, et al. ∙ 0

• ### Connecting Bayes factor and the Region of Practical Equivalence (ROPE) Procedure for testing interval null hypothesis

There has been strong recent interest in testing interval null hypothesi...
03/07/2019 ∙ by J G Liao, et al. ∙ 0

• ### Are You Sure You're Sure? – Effects of Visual Representation on the Cliff Effect in Statistical Inference

Common reporting styles of statistical results, such as confidence inter...
02/17/2020 ∙ by Jouni Helske, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

-value is one of the most popular statistical inference tools. It is widely used in decision making process concerning data analysis in many domains. For example, Chavalarias et al. (2016) identified 4,572,043 -values in 1,608,736 MEDLINE abstracts and 3,438,299 -values in 385,393 PubMed Central full-text articles between 1990-2015. However, -value is frequently misused and misinterpreted in practice. For instance, the

-value is often misinterpreted either as the probability that the null hypothesis holds, or as the error rate that the null hypothesis is falsely rejected; cf.,

Berger (2003) and references therein. Many concerns have been raised about the practical issues of using -values (see, e.g., Nuzzo (2014); Baker (2016); Wasserstein & Lazar (2016); Benjamin et al. (2017); Chawla (2017), among many others).

We speculate that the issues of -value may be partially due to the facts that the traditional -value definitions are not rigorous—the desired features of the performance of -value are not clearly and mathematically presented, and their interpretations are often not straightforward (i.e., -value is not interpreted as a measure of strength of the evidence obtained from the data in support of the null hypothesis). There is no clarification in the literature on whether a -value provides any evidence for “accepting” the null, and the actual meaning of a non-small -value is always missing. In particular, under the same settings, how do we interpret a p-value of 0.80 compared to another one, say, 0.50? So far, no precise answer is given. This is an important aspect for making inferences in practice, because most people rely on as the threshold to make decisions, but many have also argued that the threshold should be a different value and given by domain experts (cf., e.g., Adibi et al., 2019).

The goal of this paper is to provide a broader perspective of -value that:

• gives us a more comprehensive understanding of the data, not only restricted to standard hypothesis testing problems (one- and two-sided tests);

• allows us to extract relevant information from the given dataset in terms of its evidence in support of a target hypothesis;

• can be readily used as a decision tool in comparing -value with a given significant level (), when a decision making is needed.

For theoretical justifications, we propose a general and rigorous definition of -value characterized as two performance assessments. This formal definition directly relates to the logic behind the -value development and subsumes almost all existing definitions. We then propose a concrete approach based on the concept of confidence distribution (CD) (cf., Xie & Singh (2013) and references therein), to formulate and calculate such -values. The -value calculated by CD satisfies the general performance-based definition. We show that this CD-based approach has several advantages:

• it provides an intuitive and meaningful interpretation of -value as the strength of evidence from the given data in support of a hypothesis, as well as a meaningful level of degree of such support (e.g. -value of has more support than );

• it is applicable to a wide range of testing problems beyond the standard one- and two-sided tests, such as tests with odd-shaped null spaces;

• it enables us to obtain test results directly, without having to explicitly construct a test statistic and evaluate its sampling distribution under the null hypothesis.

### 1.1 A brief review of the p-value

While computations of -values date back to the 1770s (Stigler, 1986), the -value was first formally introduced by Pearson (1900) in Pearson’s chi-squared test for two by two tables. Start from 1920s, Fisher (1925, 1935a, 1935b) popularized it and made it an influential statistical inference tool.

The logic – The -value is often used to quantify the statistical significance for “rejecting” the targeted statement (null hypothesis). The logic behind the -value development is proof by “contradiction”:

• Assuming the statement is true, the -value is designed as an assessment to evaluate the chance, or how likely, the observed data is “compatible” with this statement. If the -value is small, we “consider” there is a conflict (contradiction), which indicates that the statement fails to account for the whole of the facts (Fisher, 1925).

Unlike the usual proof-by-contradiction in (nonrandom) math problems, statistics deal with random phenomena and we rarely have % certainty that a decision (reject or do not reject the statement) is correct. To overcome this obstacle, the frequency (frequentist) argument is often adopted – rejecting a correct statement should be avoided for majority of the time. Here, the actual meaning of “majority” is linked to the chosen threshold value (significance level) which is considered “small”. For instance, suppose a statement is correct, we hope that, in tries, at least times we can make the correct decisions in not rejecting the statement; then, we choose as the threshold and reject the statement if the -value.

Textbook definitions – There are two standard ways of defining the -value. Both are tied to a hypothesis testing problem (say, versus ) and a test statistic (say, ), where denotes observable data having distribution indexed by . Suppose is observed. The first way defines the -value as an “upper bound probability” (cf., e.g., Abell et al. (1999)):

 pval1(x)=supθ∈Θ0Pθ{T(X)≥T(x)}; (1)

while the second way is based on rejection region of level (cf., e.g., Lehmann & Romano (2005)):

 pval2(x)=inf{α:T(x)∈Rα}. (2)

Both definitions have achieved successes in computing -values in many real applications. However, several issues still exist (e.g., Sellke et al. (2001); Goodman (2008)). First, as a probability statement in appearance, (1) can easily lead to a widespread misunderstanding that -value is the probability that is true; while (2), which is based on significant levels (error rates), can cause a common confusion between -value and an error rate. Second, neither of them provides a direct connection or a clear evidence-based interpretation to the logic outlined in . Specifically, (1) is often interpreted as: assuming is true, the probability that is “as least as extreme (inconsistent with ) as” its observed value . Although logically correct, this bases on non-occurred results somehow inconsistent with , which is indirect and antagonistic to our interest. In addition, the connection between (2) and is vague and indirect, since the conditioning on is hidden. Furthermore, both definitions require the specific and/or and often limit the constructions of -values to the standard one- or two-sided tests. These constructions can be complicated or difficult, when and/or are difficult to get, or the distribution of is not of standard form.

Performance-based characteristics – Directly following [L], we notice that two characteristics of the performance of the -value are important, and also meet the common understandings in the literature: The first characteristic is that, when a statement (null hypothesis ) is true, the corresponding -value is stochastically equal to or larger than Uniform[0,1], which is a formal statement of the consensus that -value typically follows uniform distribution under (c.f., e.g., Berger & Boos (1994); Liu & Singh (1997); Shafer et al. (2011)). This -distributed characteristic is perfectly in line with [L] and suggests that if we repeatedly use the defined -value as a tool and reject when calculated value is smaller than , the probability of mistakenly rejecting a correct will be less than 100%. The second characteristic is that, when a statement is false, the corresponding -value should be getting closer and closer to zero as sample size increases. It can also be rephrased as given significance level , the probability of correctly rejecting a false (when -value) will be close to one as long as the sample size is sufficient. This characteristic ensures that we will be able to tell apart and its complement as more and more information is collected. Indeed, the two characteristics above insure the performance of a test using

-value, in controlling Type-I error under the null and ensuring testing power under the alternative.

In the literature, there are several different definitions and interpretations of -values other than the textbook versions. For example, Schervish (1996) discussed on a unified version of the -value for one-sided and two-sided tests in certain scenarios. Mudholkar & Chaubey (2009) introduced a generalized

-value definition which depends on a partial ordering of random variables and constructed

-values using the results under Neyman-Pearson framework. Martin & Liu (2014) gave an interpretation by plausibility function under the framework of inferential model. See also Bickel & Doksum (1977); Tsui & Weerahandi (1989); Couso & Sanchez (2008); Patriota (2013) for other developments from different approaches. However, in all above cases, neither an evidence-based interpretation of the -value as the strength of evidence in support of the statement, nor a unified and rigorous formulation of the -value, is given.

### 1.2 Arrangements of the paper

In Section 2, we propose a formal and performance-based definition of the -value, directly linked with the key logic [L]. Our proposal subsumes the textbook definitions as well as the so-called limiting -value defined in (but not limited to) the bootstrap literature (Beran, 1986; Singh & Berk, 1994; Liu & Singh, 1997). Based on this definition, we are able to broaden the concept of -value to a mapping that assesses the strength of evidence obtained from data supporting a statement. In Section 3, we propose a concrete approach using the confidence distribution (CD) (cf., e.g., Xie & Singh (2013); Schweder & Hjort (2016)). Under CD, we formulate and interpret the -value as a support of the null space . Specifically, in Section 3.1, we first give a brief review of the CD concept, and then introduce direct support and indirect support under CD, which provides an evidence-based interpretation of -value for the standard one- and two- sided test, respectively. Furthermore, to pursue a potential unification, we propose full support by combining direct and indirect support. In Section 3.2, we present a unified construction of -value for univariate cases, based on the supports under CD. We show that our proposal satisfies the performance-based definition, and typically agrees with the textbook -values. More importantly, we show that the proposal is also applicable for a wide range of hypothesis testing problems including: i) tests with interval null hypotheses, which are motivated by many practical problems (Hodges & Lehmann, 1954; Balci & Sargent, 1981, 1982; Freedman et al., 1984); ii) the intersection-union test, of which a special case is the widely-used bio-equivalence test (c.f. Schuirmann (1981, 1987); Anderson & Hauck (1983); Berger & Hsu (1996)). In Section 3.3, we discuss on the general guidelines of our CD-based construction of -value mappings. In Section 4, we extend our proposal to tackle with multivariate hypothesis testing problems, where the form/shape of the null space can be various. In such cases, we formulate the supports based on the limiting -values given by Liu & Singh (1997) using data depth (c.f., e.g., Liu (1990)) and bootstrap. Numerical examples are conducted in Section 5 to illustrate the broad applicability and computational feasibility of this approach. We show that our proposal is a safe and universally effective approach one can be applied broadly, without further consideration of the form/size of the null space. Especially, other than standard tests, we consider the situations where the null space is a small interval, a union of small intervals or a small region. As for existing testing methods, the solutions have not been available or cannot be easily obtained.

## 2 A GENERAL Definition of p-Value Based on Performance

Let denote the random sample of size from a distribution indexed by a parameter . Let be the Borel algebra of the parameter space , and be the sample space corresponding to observed sample data . Consider the statement of interest , where . Let be a mapping: . We propose a performance-based definition of -value as follows.

###### Definition 1

(A) The value of is called a -value for the statement , if , as a function of the random sample , satisfies the following conditions for any ,

(i) ,  for all  ;

(ii) , as ,  for all  .

(B) The value of is called a limiting -value () for , if condition (i) is replaced by the following asymptotic condition:

(i’) , for all .

The conditions (i) and (ii) above highlight the performance-based characteristics of -value directly linked to the key logic [L]. Given a significance level , we require that: 1) the probability of mistakenly rejecting a correct be at most ; 2) the probability of correctly rejecting a false be getting closer and closer to one as sample size increases. Consider the hypothesis testing problem with the null hypothesis versus the alternative (say, ), conditions (i) and (ii) specify the performance of a test by controlling Type-I error under and ensuring power under , respectively.

Typically, to show that is a -value mapping, one needs to show that is stochastically equal to or larger than Uniform[0,1] for all , and degenerates to 0 for all . The following proposition indicates that the textbook approaches (1) and (2) provide such mappings, as consequences, the textbook -values satisfy Definition 1. A rigorous proof is given in Appendix A.

###### Proposition 1

Suppose is a test statistic constructed so that a large value of contradicts , and that for any

, the exact cumulative distribution function (c.d.f.) of

(denoted by ) exists. Then, in (1) and in (2) satisfy Definition 1.

Many -values in real applications are derived from the limiting null distributions of test statistics (i.e., is asymptotical), and they are approximations of the “exact” -values. These approximations are often limiting p-values (LPs), which are also defined in Definition 1. In the following, Example 1 presents a frequently-used for testing about a normal mean, and Example 2 (c.f., e.g., Liu & Singh (1997)) discusses a more general situation where the exact computation of the -value is extremely difficult.

###### Example 1

Consider a sample data , from with both and unknown. Then for the left one-sided test

 H0:θ≤θ0 versus HA:θ>θ0, (3)

a based on z-test is

 p(yn,(−∞,θ0])=Φ(√n(θ0−¯yn)/sn), (4)

where is the c.d.f. of standard normal, is the sample mean and

is the sample standard deviation.

Instead, let be the c.d.f. of the -distribution, we have which is the (exact) p-value obtained by the classical -test.

###### Example 2

Consider testing the population mean in the left one-sided test (3) as in Example 1

. Here, we concentrate on distributions with finite variances:

versus . Then, given sample data , (4) is still a

by the central limit theorem.

Although it does not provide a specific way to construct -values, Definition 1 provides two fundamental requirements that ensure the argument of “proof by contradiction” in [L] can strictly go through mathematically. In particular, (i) or (i’) is a basic requirement for a -value, since controlling the size of a test is of primary concern for designing the test; while (ii) is a minimal requirement for -values on the power of the test, and one could seek testing procedure with appropriate mapping for achieving better power or other purposes. Through mappings over the sample space and the parameter space, Definition 1 can cover almost all -values available in statistics literature. It enables us to justify any candidate of -value mapping, and guarantees the desired features of using the defined -values. More importantly, it allows us to broaden the concept of -value to a mapping measuring the strength of evidence coming from the observations in support of the null space .

Note that in Definition 1, if is not a closed set, in the required conditions may be replaced by its closure set (the set contains all the boundary limit points). Since this situation is very rare in real applications, we shall assume is closed throughout this article. In addition, it is not difficult to see our proposed -values tend to zero under the corresponding alternative hypotheses, since the -value mappings are not properly centered and will vanish outside of with large . We shall avoid repeating this observation and focus on condition (i) or (i’) in justifications.

## 3 p-Values Based on Confidence Distribution

Our proposed performance-based definition is rigorous, but it does not provide a specific way to construct -values. Whenever applicable, we can use the textbook approaches to compute -values. In Section 3 & 4, we propose an alternative approach that uses confidence distribution (CD) to formulate and calculate -values. The benefits of this CD-based construction include:

• for a wide range of hypothesis testing problems, it satisfies Definition 1;

• through CD supports, it affords an interpretation of the -value as the strength of evidence in support of the null.

In the following, we first review the concept of CD and then propose CD-based notions of -value for univariate hypothesis testing problems. Multivariate cases will be discussed in Section 4.

### 3.1 The Concept of CD Supports

#### 3.1.1 A brief review of CD and its connection to p-value

From the estimation’s point of view, CD is a “distribution estimator” of the parameter of interest in frequentist inference. CDs are to provide “simple and interpretable summaries of what can reasonably be learned from data (and an assumed model)”

(Cox, 2013). A formal definition of CD (cf., e.g., Xie & Singh (2013); Schweder & Hjort (2016)) is as follows:

###### Definition 2

A function on is called a confidence distribution (CD) for a parameter , if it follows two requirements: (i) for each given sample set , is a continuous cumulative distribution function on

; (ii) the function can provide confidence intervals (regions) of all levels for

.

Here, emphasizes the distribution-estimator nature of CD, while is imposed to ensure that the statistical inferences derived from the CD have desired frequentist properties linked to confidence intervals (CI). When is univariate, indicates that at the true parameter value , , as a function of the sample set , follows .

###### Example 3

Consider the settings in Example 1. For simplicity, assume . Immediately, we have a point estimate of as , an interval estimate (95% CI) as and a sample dependent distribution function on as , of which the c.d.f. is .

Here, is a CD of . Notice that can provide CIs of all levels. For , a 100()% CI of is . Also, the mean (median) of is the point estimator, and the tail mass is a -value for testing (3).

Therefore, as a typical CD of , provides meaningful information for making inferences about . Note that also matches the Bayesian posterior of with a flat prior.

If the requirement is true only asymptotically and the continuity requirement on is dropped, the function is called an asymptotic CD (aCD) (Xie & Singh, 2013).

###### Example 4 (continues=eg1)

An aCD of can be obtained based on normal approximation, , which matches the form in (4) exactly.

Although CD is a purely frequentist concept, it links to both Bayesian and fiducial inference concepts. “Any approach, regardless of being frequentist, fiducial or Bayesian, can potentially be unified under the concept of CDs, as long as it can be used to build confidence intervals (regions) of all levels, exactly or asymptotically” (Xie & Singh, 2013). Some examples of CDs include: bootstrap distributions (Efron, 1982), -value functions (Fraser, 1991), Bayesian posteriors, normalized likelihood functions, etc.

Particularly, to illustrate the connection between CD and -value function, consider common situations where there exists a pivot with continuous c.d.f. , independent from and . Suppose is increasingly monotonic with respect to and has the form where is an arbitrary estimator and

is the standard error. For the left one-sided test (

3), we construct a mapping based on (1):

 pval1(xn)=supθ∈(−∞,θ0]PU(U≥u)=PU(U≥^θ−θ0SE(^θ))=1−GU(^θ−θ0SE(^θ)), (5)

where denotes the observed value of , and correspondingly, is the sample-dependent estimate. Given , can be written as

 {GU(^θ−θ0SE(^θ))≥1−α}={^θ−θ0SE(^θ)≥q1−α}={^θ−θSE(^θ)≥q1−α+θ0−θSE(^θ)},

where is the quantile of . Clearly, (5) satisfies both (i) and (ii) in Definition 1. Note that when , tend to zero for any reasonable estimator . Immediately, we have the following observations: given , let vary in , is a c.d.f. on ; let be the true value of and be random, as a function of follows Uniform. Therefore, is a typical CD of .

#### 3.1.2 Direct support under CD

Let denote a CD of , and be the corresponding density function. For a subset on the parameter space, we consider a measure of our “confidence” that the subset covers the true value of .

###### Definition 3

Let . The direct support (or evidence) of under CD is

 SDn(Θ0)=∫θ∈Θ0dHn(θ). (6)

The direct support, also called “strong support” (Singh et al., 2007), is a typical “measure of support” (cf., e.g., Schervish (1996)). The motivation here is to look at the CD of as a plausible distribution to which belongs, conditioning on the given data. Intuitively, the higher the support of is, the more likely an estimate of falls inside , thus it is more plausible that . As a special case, if a Bayesian posterior is used as a CD, (6

) is equivalent to the posterior probability of

.

Based on the previous discussions, (5) is the textbook -value for the one-sided test with . In the meanwhile, (5) provides a -value mapping, leading to the fact that the direct support . Clearly, the argument is still valid for . Then, the connection between direct support and one-sided -value can be summarized as follows:

###### Proposition 2

If is of the type or , the textbook p-value typically agrees with the direct support .

The following lemma illustrates the properties of the direct support, which applies to not only the one-sided tests, but also a wider set of problems—a union of intervals. Here, we assume the following regularity condition: as , for any finite , and positive , ,

 supθ∈[θ1,θ2]Pθ(max{Hn(θ−ϵ),1−Hn(θ+ϵ)}>δ)→0.

The proof of Lemma 1 is given in Appendix B.

###### Lemma 1

(a) Let be of the type or . Then, for any , .

(b) Let where are disjointed of the type , or . Here, . If the regularity condition holds, then , as , for any .

In the above cases, the -value can be calculated and interpreted as the direct support of the null space, which seals the fact that -value is used to measure the strength of evidence “supporting” the null. This is a factual argument in comparison with the widespread but indirect statement that -value measures evidence “against” the null (cf., e.g, Mayo & Cox (2006)). In the meanwhile, (6) provides a CD-based measure of the degree/strength of the support. To encompass the “measure of support” properties and -value’s evidence-based interpretation, our approach places a meaning of large -value, e.g. -value of 0.8 has more support than 0.5

. This is very similar to the Bayesian posterior probability. It is well-known in Bayesian perspective that, if we choose non-informative priors for location parameters: 1) Bayesian credible intervals match the corresponding confidence intervals guaranteeing the frequentist coverage; 2) the posterior probabilities of the null hypothesis typically agree with

-values for the one-sided tests. Thus, this “coincidence” between CD and Bayesian inferences is a clarification rather than a misinterpretation.

###### Remark 1

Although calculates and interprets the corresponding -value in a wide range of problems, its usefulness is limited. Since the CD density is generally continuous, when is narrow, would almost always be small unless is sufficiently large (c.f., Figure 2 & 4 in the simulation study); more extremely, when is degenerated to a singleton, would be simply zero regardless of where lies. In such cases, due to the fact that the width of has a non-negligible effect on the value of , an alternative way of evaluating the evidence to is desired.

#### 3.1.3 Indirect support under CD

To avoid the undesired influence of the width of in measuring its support, we propose another measure of the strength of evidence, called indirect support, as follows.

###### Definition 4

The indirect support (or evidence) of a subset under CD is

 SINDn(Θ0)=infθ0∈Θ02min{Hn(θ0),1−Hn(θ0)}. (7)

Clearly, when is a singleton, say , we have

 SINDn(θ0)≡SINDn({θ0})=2min{Hn(θ0),1−Hn(θ0)}. (8)

The motivation here is to exam how plausible it is to assume . To facilitate such an examination on some , for which does not work, we build up some room to consider the opposites of . Denote , , respectively. Then, we have

 SINDn(θ0) = 2min{1−SDn(Θup),1−SDn(Θlo)} = 2[1−max{SDn(Θup),SDn(Θlo)}].

Like the proverb said, “the enemy of my enemy is my friend”. In the form of indirect support, we first consider the direct support of “enemies” of , and . Next, by max, we take the stronger side (“the tougher enemy”) as a measure of evidence “against” . Then, “the enemy of enemy”, max, can be used to measure the indirect evidence to on one direction (side). Finally, to adjust two directions (sides), we multiply the value above by . Intuitively, when is extreme in either direction, (8) will be small, indicating that is implausible.

The following proposition implies that the above interpretation through the indirect support can be applied for two-sided -values.

###### Proposition 3

If is a singleton , the textbook -value typically agrees with .

The justification of this proposition is given in Lemma 2 (). Briefly speaking, if is the true value, we have the key facts that and . Then, .

###### Lemma 2

Let be a singleton , , where .

More generally, when is a subset, we evaluate all the points by and report the smallest value of support. A large value of indicates no extreme (inconsistent) value is contained in , implying that is plausible; while a small value means that contains some extreme (inconsistent) values, and is plausible only if the direct support of it is large. When is one-tailed interval, is simply zero. Therefore, the indirect support can be treated as a useful and necessary complement of the direct support. It is then intuitive that we may use a combination of the direct and the indirect support to measure the strength of evidence. Later on, we will propose to combine these two types of support to form a unified notion of -values for both one- and two-sided tests.

#### 3.1.4 Full support under CD

Based on the previous discussions, direct and indirect supports can provide evidence-based interpretations of one- and two-sided -values, respectively. However, the one- and two-sided -values are treated very differently in terms of both calculation and interpretation. We propose a combined measure of evidence, which can fill this gap.

###### Definition 5

Let . The full support (or evidence) of under CD is

 S+n(Θ0)=SDn(Θ0)+SINDn(Θ0). (9)

Here, has two parts: measures of the direct and indirect evidence in support of . The former is the distribution estimated measurement of , while the latter measures an adjustment based on indirect evidence to “the enemy of enemy” for . Altogether, -values are computed based on a combination of the direct and indirect parts.

Consider the conventional one-sided and two-sided hypothesis tests. First, for one-sided tests with or , (9) is i.e., the direct support. Second, for two-sided tests with , i.e., the indirect support. Therefore, for both one- and two-sided tests, (9) matches the textbook -value. Furthermore, the definition of (9) is very general and can accommodate a wide range of testing problems.

The validity of when is an interval – Consider the situations where belongs to the following set of intervals,

 A={[c,d]:c,d∈¯¯¯¯R and c≤d}, (10)

where = is the extended real number system. It is clear that the null spaces in one- and two-sided tests are special cases.

To justify the validity, we first introduce a lemma on combining a -value mapping with a “degenerated” mapping. The proof is given in Appendix C.

###### Lemma 3

Suppose that, is a -value (or ) of the statement . Let be a mapping satisfying that

 Pθ{q(Xn,Θ0)≤α}→1, as n→∞, for all θ∈Θ∖Θ0, for any α∈(0,1).

Then, is another -value (or ) of .

Note that unless is degenerated, , as , for . In such cases, is a -value (or ), because of the nice properties of . Based on Lemma 1, 2 & 3, we summarize the resulting features as follows.

###### Theorem 1

Consider the mapping defined in (9). (a) Let be of the type or . Then .

(b) Let be a singleton , .

(c) Let . If the regularity condition holds,

(d) Let where are disjointed of the type , or . Here, . If the regularity condition holds, then , as , for any .

###### Example 5 (continues=eg1)

Consider vs. . A is

 S+n(Θ0) = SDn([a,b])+2min{SDn((−∞,a)),SDn((b,∞))} = {Φ(√n(¯yn−a)/sn)+Φ(√n(¯yn−b)/sn) if ¯yn<(a+b)/2;Φ(√n(a−¯yn)/sn)+Φ(√n(b−¯yn)/sn) if ¯yn≥(a+b)/2.

This result is an “n-sample” version of the -value given by Schervish (1996), which is derived from the corresponding uniformly most powerful unbiased test (e.g., cf., Lehmann (1986)).

### 3.2 A Unified Notion of p-Value for Univariate θ

Based on the concepts of CD supports, we are ready to provide a unified notion of -value as follows.

###### Definition 6

Let and , where are disjointed.

 p(Xn,Θ0)=max1≤i≤kS+n(Θ0i), (11)

where is the full support defined in under CD .

Note that, when , . In such cases, the validity of as a -value mapping can be shown based on the validity of .

The validity of where is a union of intervals – In practice, there is an increasing demand of non-standard types of hypothesis testing, where the null spaces are not restricted in (10). For instance, in a bio-equivalence problem, the parameter of interest is a measurement for assessing bio-equivalence of two formulations of the same drug or two drugs, e.g., or , where and are the population means of bioavailability measures of the two formulations/drugs. Let and be some known bio-equivalence limits (e.g., ), the following testing problem often considered: . More generally, the intersection-union test (Berger, 1982) has the following form:

 H0:θ∈K⋃i=1Θ0i  versus  HA:θ∈K⋂i=1{Θ∖Θ0i}, (12)

where ’s are disjointed intervals. The validity of the -value mapping can be shown by the following theorem.

###### Theorem 2

Suppose that, for , is the corresponding (limiting) -value of the statement , . Then, is the (limiting) -value of the statement .

To measure the evidence to the null space , we turn to consider hypothesis testing problems corresponding to null space , . For each , provides the full support and a -value. Then, measures the largest evidence among ’s. For any , , for all , implies . On the one hand, the proper design of each guarantees the size of test , then the size of test is guaranteed; on the other hand, in order to reject , we need to reject all . In sum, the evidence to is small only if the evidence to every is small, and the evidence to is large if the evidence to some is large. The idea of handling a bio-equivalence test by “two one-sided tests” (Schuirmann, 1981) can be considered as a special case. A simple and clear way of calculating -value for bio-equivalence test is given in the following example.

###### Example 6 (continues=eg1)

Consider a bio-equivalence test vs. , where , are known. We can obtain

In addition, based on Theorem 2, where (), allows us to provide -values more broadly than the regular intersection-union test (12). For instance, the results still hold when some or even all ’s are singletons.

### 3.3 Guidelines of constructing p-value mappings

Up to this point, a unified and comprehensive notion of -value is provided. It is important to notice that -value construction is not unique and modification might be available in case-by-case scenarios (e.g., see the discussions in Remark 1). Whatever the way of -value construction, the bottom line is that the defined mapping satisfies the two requirements in the performance-based Definition 1 ; i.e., i) for all , is stochastically equal to or larger than Uniform (at least asymptotically); ii) for all , goes to 0, as . Correspondingly, almost all modifications in the field of hypothesis testing problems, not restricted in the -value approaches, are concerning two key points: a) guarantee the size of the test, especially when the sample size is limited; b) achieve better power.

As to the comparison of two -value mappings under the same scenario, we emphasize that there is a trade-of between controlling the size of the test and pursuing better power. For each individual case or application, this trade-off should be best determined by domain experts. For example, consider the case with , since , for the same asymptotic size, testing by will reject more, and therefore have better power. However, for small samples, may not guarantee the size and may be overly aggressive compared to . The limitations of have been discussed in Remark 1. If controlling the size is often of the primary concern, we may consider as a modification of . First, provides a comprehensive and unified notion of -value (for any ). Second, when is narrow and is not large, is preferred in terms of controlling the size of the test, since (c.f., Figure 1 & 2 in the simulation study).

###### Remark 2

In the context of intersection-union test (12), there exists another recommendation of the -value mapping as (cf. e.g. Singh et al. (2007)). When all ’s are intervals, the validity is obvious based on Theorem 3. Since , testing by the latter will have better power. However, it can be overly aggressive in controlling the Type I error, especially when one or more is singleton or small interval. Moreover, when all are singletons, is not applicable and should be used.

###### Remark 3

To enhance power, we may consider the following -value mapping construction. Let and , where are disjointed. Write the piecewise -value , and the corresponding ordered -values , . Consider

 p∗(Xn,Θ0)=p(1)−p(2), (13)

where is the largest one among ’s and is the second largest one.

###### Example 7 (continues=eg1)

Consider a bio-equivalence test vs. , where , are known. We can obtain

###### Remark 4

The aforementioned CD approaches of constructing -value mappings in the intersection-union test for a single parameter can also accommodate more complex settings such as some multi-parameter cases (as seen in the example in section 4 of Berger (1982)). More specifically, consider and , can still be applied.

While noting that the CD is used as a general tool to formulate and interpret -values, CD itself does not rely on the null or alternative hypothesis. Therefore, the “supports” of multiple (mutually exclusive) sets under CD can be obtained identically as in a univariate case. And once the CD is obtained, the proposed CD-based mappings can derive -values for various choices of the null space . Potentially, this can answer the common complaint on the classical testing approach (as articulated in Marden (2000)) that “model selection is difficult”.

## 4 CD-based Notions of p-Value for Multivariate Parameters

Our CD approach can also be used for multi-dimensional problems. In this section, we extend our CD-based -value mappings to multivariate hypothesis testings. With the help of bootstrap method and data depth, we can even skip the specification of CD, and build the -value mappings by the bootstrap estimates directly. For the direct support, we simple consider the fraction of bootstrap samples that lie in the null space. For the indirect support, we apply the concept of data depth to determine the fraction of possible values in the parameter space, that are more outlying (less consistent) than the null space. In the following, we first give a brief description of a well-known notion of data depth, Liu’s Simplicial Depth (Liu, 1990).

Given from the distribution in , the Simplicial Depth of a given point with respect to and the data cloud is

 D(Φ;w)=PΦ{w∈S(Z1,⋯,Zk+1)},

where is the closed simple whose vertices are random observations from . The sample version of is