1 Introduction
Despite its common practice, statistical hypothesis testing presents challenges in interpretation. For instance, some understand that an hypothesis test can either accept or reject the null hypothesis,
. However, in this paradigm the probability of accepting
can be high even when is false. Therefore, it is possible to obtain the undesirable result of accepting even when this hypothesis is unlikely.In order to deal with this problem, others propose that an hypothesis test should either reject or fail to reject (Casella and Berger 3, p. 374 and DeGroot and Schervish 5, p. 545). Such a position can also lead to challenges in interpretation, since the practitioner often wishes to be able to assert [13]
. For example, in regression analysis nonsignificant predictors are often considered to not affect the response variable and are removed from the model. More generally, scientists often wish to assert a theory
[20, 21].Neyman [17][p.14] briefly introduces an alternative to the above paradigms to hypothesis testing. In this setting, an hypothesis test can have three outcomes: reject , accept , or remain in doubt about — the agnostic decision. This third decision allows the hypothesis test to commit a less severe error (remain in doubt) whenever the data doesn’t provide strong evidence either in favor or against the null hypothesis. This approach, which was called agnostic hypothesis testing, was further developed in Berg [1], Esteves et al. [6], Stern et al. [22]. This framework allows the acceptance of while simultaneously controlling the type I and II errors through the agnostic decision. As a result, it is possible to control the probability that is accepted when is false.
Although agnostic decisions have been used in classification problems with great success [12, 9, 10, 18] the agnostic hypothesis testing framework has only started to be explored. Here, we generalize to arbitrary hypotheses the setting in Berg [1], which applies only to hypotheses of the form: , for
. This generalization allows the translation of standard concepts, such as level, size, power, pvalue, unbiased tests, and uniformly most powerful test into the framework of agnostic hypothesis testing. Within this framework, we create new versions of standard statistical techniques, such as ttests, regression analysis and analysis of variance, which simultaneously control type I and type II errors.
Section 2 formally defines agnostic tests and concepts that are used for controlling their error, such as level, size and power. Sections 2.2 and 2.1 use these definitions to generalize the framework in Berg [1]; they derive agnostic tests that are uniformly most powerful tests and unbiased uniformly most powerful tests. Since it can be hard to obtain the above tests in complex models, Section 3 derives a general approach for controlling the error of agnostic tests that is based on pvalues. Section 4 advances results that were obtained in Esteves et al. [6], Stern et al. [22] and shows that agnostic tests can control type I and II errors while retaining logical coherence. Section 5 discusses how to control the type I and II errors while obtaining consistent agnostic tests. All proofs are presented in the supplementary material.
2 The power of agnostic tests
We consider a setting in which the hypotheses that are tested are propositions about a parameter, , that assumes values in the parameter space, . Specifically, the null hypotheses, , are of the form, , where . The alternative hypotheses, , are of the form . In order to test , we use data, , which assumes values on the sample space, . Also, denotes the probability measure over when .
is tested through an agnostic test. An agnostic test is a function that, for each observable data point, determines whether should be rejected, accepted or remain undecided. Let denote the set of possible outcomes of the test: accept (0), reject (1), and remain agnostic .
An agnostic test is a function, .
An agnostic test, , is a standard test if Im.
An agnostic test can have
types of errors. The type I and type II errors of agnostic tests are defined in the same way as those of standard tests. That is, a type I error occurs when the test rejects
and is true. Similarly, a type II error occurs when the test accepts and is false. A type III error occurs whenever the test remains agnostic. An agnostic test can be designed to control the errors of type I and II.An agnostic test, , has level if the test’s probabilities of committing errors of type I and II are controlled by, respectively, and . That is,
Similarly, has size if the probabilities of committing errors of type I and II are upper bounded by and . That is, and .
Agnostic tests can be compared by means of their power. The power function of a test is the probability that it doesn’t commit an error. That is, the probability that it accepts when is true or rejects when is false.
The power function of an agnostic test, , is denoted by .
Let and be agnostic tests. We say that is uniformly more powerful than for and write if, for every , .
2.1 Uniformly most powerful tests
An level agnostic test, , is uniformly most powerful (UMP) if, for every other size agnostic test, , .
In the following, section 2.1 presents general conditions under which we can find UMP agnostic tests. These conditions are the same as the ones that are typically used in the standard frequentist framework [3][p.391].
Assumption

For every , is absolutely continuous with respect to the Lebesgue measure, , and .

There exists a sufficient statistic for , , and the likelihood is monotone over .
Section 2.1 and Section 2.1 present the agnostic tests that are UMP under Section 2.1.
Let be a statistic and . The agnostic test, , is
Let , be such that , and be such that . Under Section 2.1,

If , then is an UMP size agnostic test.

If and are such that (and thus is not well defined), then let . For every size agnostic test, , there exists such that .
Section 2.1 generalizes several previous results in the literature. For example, if and , then the likelihood is monotone over . In this setting, Berg [1] shows that, if , then is the UMP agnostic test. Also, one can emulate the standard frequentist framework by not controlling the type II error, that is, by considering size tests. In this case, is the set of size UMP tests in the standard frequentist framework [3][p.391].
Similarly to this case in which , the second condition in Section 2.1 occurs whenever the control over and is sufficiently weak so that there exist standard tests of size and there is no need of using the agnostic decision. In this case, the tests in cannot be uniformly more powerful than one another because of a tradeoff in the power in each region of . If , and , then the comparison of the critical regions of and reveals that the power of is higher over and the power of is hgiher over . That is, the choice between the elements in depends on the desired balance between the power over and over .
In the following, Section 2.1 presents an application of Section 2.1.
[Agnostic ztest] Let be an i.i.d. sample with , where and is known. Let and be the sample mean. Note that the conditions in Section 2.1 are satisfied. Furthermore, if , then by taking and , one obtains that , and . Therefore, it follows from Section 2.1 that is an UMP level agnostic test.
Figure 1 illustrates the probability of each decision of this test as well as its power function when , and .
2.2 Unbiased uniformly most powerful
Besides the case studied in Section 2.1, there often do not exist UMP tests. For example, they might not exist when the model has nuisance parameters. This often occurs because it is possible for a test to sacrifice power in a region of in order to obtain a high power in another region. However, such sacrifices might yield undesirable tests. These tests are characterized in the following passage.
An example of an undesirable test is a test that uses no data. For example, if and , then is a test that uses no data and that attains level . Furthermore, for every , and also for every , . A generalization of this idea is to consider that a desirable test, , should dominate trivial tests of the same level, that is, for every , and for every , . Such tests are usually called unbiased.
An agnostic test, , is unbiased if
Note that, if is unbiased, then .
Once only unbiased tests are considered, it is often possible to find an uniformly most powerful test. In the following, Sections 2.2 and 2.2 present general conditions under which there exist tests that are uniformly most powerful among the unbiased tests. These conditions are the same as the ones that are typically used in the standard frequentist framework [11][p.151].
An level test is said to be uniformly most powerful among unbiased tests (UMPU) if, for every unbiased size test, , .
Notation
Let . The th element of is denoted by . This notation is useful because is used to denote an element of and not the th element of .
Assumption

For every , is absolutely continuous with respect to the Lebesgue measure, , and .

and is in the exponential family, that is, there exists such that .

Let . There exists such that is increasing in and and are independent when .
Let , be such that , , and be such that and . Under Section 2.2, is an UMPU level test.
Section 2.2 uses section 2.2 in order to derive UMPU unilateral tests. Under the stronger conditions in section 2.2 it is also possible to derive UMPU bilateral tests, as presented in section 2.2.
Assumption
Besides the conditions in Section 2.2, also include that
Let be such that .
Let , be such that , , and for each , let and be such that
Let . Under Section 2.2 is an UMPU level test.
[Agnostic ttest] Let be an i.i.d. sample with , where and . Let and also . Let . It follows from Lehmann and Romano [11][p.153] that satisfies the conditions in Sections 2.2 and 2.2 for testing and . Therefore, if , then it follows from Sections 2.2 and 2.2 that and are the UMPU tests for and . Moreover, by defining , it follows from Lehmann and Romano [11][p.155] that and are such that
where is the
quantile of a Student’s tdistribution with
degrees of freedom. Figure 2 illustrates the probability of each decision for and when , , and . The power of both tests at is . Indeed, it follows from section 2.2 that the power of a size test at the border points of cannot be higher than .[Agnostic linear regression] Consider a linear regression setting, that is,
, where , , is a design matrix of rank d and is the vector with coefficients. For a fixed and , let and . Let . By taking, the least squares estimator for
, it follows from Shao [19][p.416] that satisfies the conditions in sections 2.2 and 2.2. Therefore, the UMPU tests, and , are such thatwhere denotes the quantile of Student’s tdistribution with degrees of freedom.
3 General agnostic tests of a given level
Oftentimes, an UMPU agnostic test does not exist or is difficult to derive. In such a situation, one might be willing to use an level test that is not uniformly most powerful. A wide class of such tests can be obtained through the pvalue of standard hypothesis tests. The definition of pvalue is revisited below.
A nested family of standard tests for , , is such that

For every , is a standard test.

The function , is bijective.

If and , then .
Let . The collection of generalized likelihood ratio tests, , is a nested family of standard tests for .
Let denote a nested family of standard tests for . The pvalue of against , is such that .
Intuitively, if is rejected whenever the pvalue is smaller than , then the type I error is controlled by . Similarly, one might expect that if is accepted whenever the pvalue is larger than , then the type II error is controlled by . section 3 provides conditions under which this reasoning is valid.
Let be a nested family of standard tests for such that, for every , is an unbiased test. Assume that is a connected space and that, for every , is a continuous function over . Let . Then, the test , i.e.,
is a level test for .
[General Linear Hypothesis in Regression Analysis] Consider the linear regression setting (fig. 2) and the general linear hypothesis
where is a matrix and . A particular case of this problem is the ANOVA test [16]. There exists no UMPU test for [7]. However, the Fstatistic
is such that, for every , is unbiased for [14]. Furthermore, it can be shown that , where
denotes the cumulative distribution function of a Snedecor’s Fdistribution random variable with
degrees of freedom. Since all conditions in section 3 are satisfied, is a level test.[Permutation Test] Let and be i.i.d. samples from continuous distributions, and . Also, consider that and . Let be a pvalue based on a permutation test such that, if is such that, for every , , then . It follows from Lehmann and Romano [11, Lemma 5.9.1] that is unbiased for . Also, under the topology induced by the total variation metric, is connected and is continuous over . Conclude from section 3 that is a level agnostic test.
4 Connections to region estimation
There exist several known equivalences between standard tests and region estimators [2, p.241]. For example, every region estimator is equivalent to a collection of bilateral standard tests. Also, standard tests for more general hypothesis can be obtained as the indicator that the hypothesis intercepts a region estimator. These connections are useful for providing a method of obtaining and interpreting standard hypothesis tests.
The following subsections show that similar results hold for the agnostic tests that were obtained previously. Section 4.1 presents a general method for obtaining agnostic tests from confidence regions. Furthermore, it shows how this method relates to logical coherence and to the unilateral tests in section 2. Section 4.2 presents an equivalence equivalence between nested region estimators and collections of bilateral agnostic tests.
4.1 Agnostic tests based on a region estimator
An agnostic test can have other desirable properties besides controlling both the type I and type II errors. For instance, Esteves et al. [6], Stern et al. [22] show that agnostic tests can be made logically consistent. That is, it is possible to test several hypothesis using agnostic hypothesis tests in such a way that it is impossible to obtain logical contradictions between their conclusions. This property generally cannot be obtained using standard tests [8]. Logically consistent agnostic tests are connected to region estimators, as summarized below.
A region estimator is a function .
[Agnostic test based on a region estimator] Let be a region estimator and . The agnostic test based on for testing , is such that
Figure 3 illustrates this procedure.
A collection of tests, is based on a region estimator if there exists a region estimator, , such that, for every , is based on .
[Esteves et al. [6]] Let be a collection of agnostic tests such that is a field over and, for every , . is logically consistent if and only if it is based on a region estimator.
It follows from fig. 3 that the collection of tests based on a region estimator is logically consistent. section 4.1 shows that, if this region estimator has confidence , then the tests based on it also control both the type I and II errors by .
If is a region estimator for with confidence and is an agnostic test for based on , then is a size test.
Furthermore, the unilateral tests that were developed in Sections 3 and 2 are based on confidence regions. In order to present such regions, section 4.1 uses sections 4.1 and 4.1.
Assumption
Let . is a collection of agnostic tests such that

[label=()]

If and , then

If and , then .
Assumption
Let . is a collection of agnostic tests such that for every such that ,
Section 4.1 requires that a collection of unilateral tests satisfy a weak form of logical coherence. That is, if and the collection of tests accepts that , then it accepts that . Similarly, if and the collection of tests rejects that , then it also rejects that . Section 4.1 requires that, for every test in the collection, the probability of the nodecision alternative in the border point of is at least . Section 4.1 shows that a collection of unilateral tests that satisfy sections 4.1 and 4.1 is based on a confidence region of confidence .
For each, , let . If satisfies section 4.1, then there exists a region estimator, , such that, for every , is based on . Furthermore, if section 4.1 holds, then is a confidence region for with confidence .
It is possible to use sections 4.1 and 4.1 in order to extend a collection of unilateral tests to a larger collection of tests. If the collection of unilateral tests satisfies sections 4.1 and 4.1, then it follows from section 4.1 that these tests are based on a region estimator, , with confidence . Therefore, it follows from section 4.1 that, for every of the type , the test for based on has size . Furthermore, it follows from fig. 3 that the collection of these tests is logically coherent. section 4.1 summarizes these conclusions.
For each, , let . Also, assume that satisfies sections 4.1 and 4.1. Let be such as in section 4.1. Consider the collection of agnostic tests , where (recall Section 4.1). Then

[label=()]

this collection is logically coherent,

each test is this collection has size , and

this collection is an extension of the collection
Under weak conditions, the tests that were developed in sections 2.2 and 2.1 satisfy sections 4.1 and 4.1. As a result, they can be used in sections 4.1 and 4.1. These results are presented in sections 4.1 and 4.1 and illustrated in figs. 3 and 3.
Consider the setting of section 2.1, and let . The collection of UMP level test presented in section 2.1 is based on a region estimator, . Furthermore, if is such that is continuous over , then has confidence for .
[Agnostic ztest] Consider again section 2.1. For each , let . Let and be the collection of UMP level tests in section 2.1. By defining the constants and , note that . It follows that is based on the region estimator , which is a confidence interval for .
Assumption
For each , let be such as in section 2.2 when . There exists a function, , which is decreasing over and such that is ancillary.
For each , let . Under section 2.2 and , let be the UMPU level test presented in section 2.2. Under section 4.1, the collection is based on a region estimator, , which has confidence for .
[Agnostic ttest] Consider again section 2.2. For each , let . Let and be the collection of UMP level tests in section 2.2. By defining , and , note that . It follows that is based on the region estimator , which is a confidence interval for .
4.2 Agnostic tests based on nested region estimators
Contrary to the unilateral tests, the bilateral tests in section 2 are not based on region estimators. Indeed, while these bilateral tests can accept a precise hypothesis, this feature cannot be obtained in tests based on region estimators. However, similarly to the case for standard tests, there exists an equivalence between collections of bilateral agnostic tests and pairs of nested region estimators. Indeed, it is possible to obtain from one another a nested pair of and confidence regions and a collection of bilateral size tests. Section 4.2 prepares for this equivalence, which is established in section 4.2.
[Agnostic test based on nested region estimators] Let and be region estimators such that, and . The agnostic test based on and for testing , , is
Figure 4 illustrates when .
[Agnostic ttest] Consider section 2.2. For each , let . The UMPU agnostic test is based on the region estimators
For each , let .

If are confidence regions for with confidence and , then for every , is a size test.

Let be a collection of size tests. If for every such that , and , then there exist region estimators, and , such that