I Introduction
In the simple binary hypothesis testing problem, one is given a source sequence and one knows that it is either generated in an i.i.d. fashion from one of two known distributions or . One is then asked to design a test to make this decision. There is a natural tradeoff between the typeI and typeII error probabilities. This is quantified by the ChernoffStein lemma [1] in the NeymanPearson setting in which the typeI error probability decays exponentially fast in with exponent given by if the typeII error probability is upper bounded by some fixed . Blahut [2] established the tradeoff between the exponents of the typeI and typeII error probabilities. Strassen [3] derived a refinement of the ChernoffStein lemma. This area of study is now commonly known as secondorder asymptotics and it quantifies the backoff from one incurs at finite sample sizes and nonvanishing typeII error probabilities . In all these analyses, the likelihood ratio test [4] is optimal.
However, in realworld machine learning applications, the generating distributions are not known. For the binary classification framework, one is given two training sequences, one generated from and the other from . Using these training sequences, one attempts to classify a test sequence according to whether one believes that it is generated from either or .
Ia Main Contributions
Instead of algorithms, in this paper, we are concerned with the informationtheoretic limits of the binary classification problem. This was first considered by Gutman who proposed a typebased (empirical distributionbased) test [5, Eq. (6)] and proved that this test is asymptotically optimal in the sense that any other test that achieves the same exponential decay for the typeI error probability for all pairs of distributions, necessarily has a larger typeII error probability for any fixed pair of distributions. Inspired by Gutman’s [5] and Strassen’s [3] seminal works, and by practical applications where the number of training and test samples is limited (due to the prohibitive cost in obtaining labeled data), we derive refinements to the tradeoff between the typeI and typeII error probabilities for such tests. In particular, we derive the exact secondorder asymptotics [3, 6, 7] for binary classification. Our main result asserts that Gutman’s test is secondorder optimal. The proofs follow by judiciously modifying and refining Gutman’s arguments in [5] in both the achievability and converse proofs. In the achievability part, we apply a Taylor expansion to a generalized form of the JensenShannon divergence [8] and apply the BerryEsseen theorem to analyze Gutman’s test. The converse part follows by showing that Gutman’s typebased test is approximately optimal in a certain sense to be made precise in Lemma 7. This study provides intuition for the nonasymptotic fundamental limits and our results have the potential to allow practitioners to gauge the effectiveness of various classification algorithms.
Second, we discuss three consequences of our main result. The first asserts that the largest exponential decay rate of the maximal typeI error probability is a generalized version of the JensenShannon divergence, defined in (3) to follow. This result can be seen as a counterpart of ChernoffStein lemma [1] which is applicable to binary hypothesis testing. Next, we show that our main result can be applied to obtain a secondorder asymptotic expansion for the fundamental limits of the two sample homogeneity testing problem [9, Sec. IIC] and the closeness testing problem [10, 11, 12]. Finally, we consider the dual setting of the main result in which the typeI error probabilities are nonvanishing while the typeII error probabilities decay exponentially fast. In this case, the largest exponential decay rate of the typeII error probabilities for Gutman’s rule is given by a Rényi divergence [13] of a certain order related to the ratio of the lengths of the training and test sequences.
Finally, we generalize our secondorder asymptotic result for binary classification to classification of multiple hypotheses with the rejection option. We first consider tests satisfying the following conditions (i) the error probability under each hypothesis decays exponentially fast with the same exponent for all tuples of distributions and (ii) the rejection probability under each hypothesis is upper bounded by a different constant for a particular tuple. We derive secondorder approximations of the largest error exponent for all hypotheses and show that a generalization of Gutman’s test by Unnikrishnan in [14, Theorem 4.1] is secondorder optimal. The proofs follow by generalizing those for binary classification and carefully analyzing the rejection probabilities. In addition, similarly to the binary case, we also consider a dual setting, in which under each hypothesis, the error probability is nonvanishing for all tuples of distributions and the rejection probability decays exponentially fast for a particular tuple.
IB Related Works
The most related work is [5] where Gutman showed that his typebased test is asymptotically optimal for the binary classification problem and its extension to classification of multiple hypotheses with rejection for Markov sources. Ziv [15] illustrated the relationship between binary classification and universal data compression. The Bayesian setting of the binary classification problem was studied by Merhav and Ziv [16]. Subsequently, Kelly, Wagner, Tularak and Viswanath [17] considered the binary classification problem with large alphabets. Unnikrishnan [14] generalized the result of Gutman by considering classification for multiple hypotheses where there are multiple test sequences. Finally, Unnikrishnan and Huang [9] approximated the typeI error probability of the binary classification problem using weak convergence analysis.
IC Organization of the Rest of the Paper
The rest of our paper is organized as follows. In Section II, we set up the notation, formulate the binary classification problem and present existing results by Gutman [5]. In Section III, we discuss the motivation for our setting and present our secondorder result for binary classification. We also discuss some consequences of our main result. In Section IV, we generalize our result for binary classification to classification of multiple hypotheses with the rejection option. The proofs of our results are provided in Section V. The proofs of some supporting lemmas are deferred to the appendices.
Ii Problem Formulation and Existing Results
Iia Notation
Random variables and their realizations are in upper (e.g., ) and lower case (e.g., ) respectively. All sets are denoted in calligraphic font (e.g., ). We use to denote the complement of . Let
be a random vector of length
. All logarithms are base . We useto denote the cumulative distribution function (cdf) of the standard Gaussian and
its inverse. Let be the corresponding complementary cdf. We use to denote the complementary cdf of a chisquared random variable with degrees of freedom and its inverse. Given any two integers , we use to denote the set of integers and use to denote. The set of all probability distributions on a finite set
is denoted as . Notation concerning the method of types follows [18]. Given a vector , the type or empirical distribution is denoted as . The set of types formed from length sequences with alphabet is denoted as . Given , the set of all sequences of length with type , the type class, is denoted as . The support of the probability mass function is denoted as .IiB Problem Formulation
The main goal in binary hypothesis testing is to classify a sequence as being independently generated from one of two distinct distributions . However, different from classical binary hypothesis testing [19, 2] where the two distributions are known, in binary classification [5], we do not know the two distributions. We instead have two training sequences and generated in an i.i.d. fashion according to and respectively. Therefore, the two hypotheses are

: the test sequence and the 1 training sequence are generated according to the same distribution;

: the test sequence and the 2 training sequence are generated according to the same distribution.
We assume that for some .^{1}^{1}1 In the following, we will often write for brevity, ignoring the integer constraints on and . The task in the binary classification problem is to design a decision rule (test) . Note that a decision rule partitions the sample space into two disjoint regions: where any triple favors hypothesis and where any triple favors hypothesis .
IiC Existing Results and Definitions
The goal of binary classification is to design a classification rule based on the training sequences. This rule is then used on the test sequence to decide whether or is true. We revisit the study of the fundamental limits of the problem here. Towards this goal, Gutman [5] proposed a decision rule using marginal types of , and . To present Gutman’s test, we need the following generalization of the JensenShannon divergence [8]. Given any two distributions and any number , let the generalized JensenShannon divergence be
(3) 
Given a threshold and any triple , Gutman’s decision rule is as follows:
(4) 
To state Gutman’s main result, we define the following “exponent” function
(5) 
Note that for and that is continuous (a consequence of [20, Lemma 12] in which is continuous if is continuous and is compact).
Gutman [5, Lemma 2 and Theorem 1] showed that the rule in (4) is asymptotically optimal (error exponentwise) if the typeI error probability vanishes exponentially fast over all pairs of distributions.
Theorem 1.
Gutman’s decision rule satisfies the following two properties:

Asymptotic/Exponential performance: For any pair of distributions ,
(6) (7)
We remark that using Sanov’s theorem [21, Chapter 11], one can easily show that, for any pairs of distributions and any , Gutman’s decision rule in (4) satisfies (6) as well as
(10) 
Note that Theorem 1 is analogous to Blahut’s work [2] in which the tradeoff of the error exponents for the binary hypothesis testing problem was thoroughly analyzed.
Iii Binary Classification
Iiia Definitions and Motivation
In this paper, motivated by practical applications where the lengths of source sequences are finite (obtaining labeled training samples is prohibitively expensive), we are interested in approximating the nonasymptotic fundamental limits in terms of the tradeoff between typeI and typeII error probabilities of optimal tests. In particular, out of all tests whose typeI error probabilities decay exponentially fast for all pairs of distributions and whose typeII error probability is upper bounded by a constant for a particular pair of distributions, what is the largest decay rate of the sequence of the typeI error probabilities? In other words, we are interested in the following fundamental limit
(11) 
From Theorem 1 (see also [5, Theorem 3]), we obtain that
(12) 
As a corollary of our result in Theorem 2, we find that the result in (12) is in fact tight and the limit exists. In this paper, we refine the above asymptotic statement and, in particular, provide secondorder approximations to .
To conclude this section, we explain why we consider instead of characterizing a seemingly more natural quantity, namely, the largest decay rate of typeI error probability when the typeII error probability is upper bounded by a constant for a particular pair of distributions , i.e.,
(13) 
In the binary classification problem, when we design a test , we do not know the pair of distributions from which the training sequences are generated. Thus, unlike the simple hypothesis testing problem [3, 22], we cannot design of a test tailored to a particular pair of distributions. Instead, we are interested in designing universal tests which have good performances for all pairs of distributions for the typeI (resp. typeII) error probability and at the same time, constrain the typeII (resp. typeI) error probability with respect to a particular pair of distributions .
IiiB Main Result
We need the following definitions before presenting our main result. Given any and any pair of distributions , define the following two information densities
(14) 
Furthermore, given any pair of distributions , define the following dispersion function
(linear combination of the variances of the information densities)
(15) 
Theorem 2.
For any , any and any pair of distributions , we have
(16) 
Theorem 2 is proved in Section VA. In (16), and are respectively known as the first and secondorder terms in the asymptotic expansion of . Since in most applications, and so the secondorder term represents a backoff from the exponent at finite sample sizes . As shown by Polyanskiy, Poor and Verdú [6] (also see [23]), in the channel coding context, these two terms usually constitute a reasonable approximation to the nonasymptotic fundamental limit at moderate . This will also be corroborated numerically for the current problem in Section IIIC. Several other remarks are in order.
First, we remark that since the achievability part is based on Gutman’s test, this test in (4) is secondorder optimal. This means that it achieves the optimal secondorder term in the asymptotic expansion of .
Second, as a corollary of our result, we obtain that for any ,
(17) 
In other words, a strong converse for holds. This result can be understood as the counterpart of the ChernoffStein lemma [1] for the binary classification problem (with strong converse). In the following, we comment on the influence of the ratio of the number of training and test samples in terms of the dominant term in . Note that the generalized JensenShannon divergence admits the following properties:

is increasing in ;

and .
Thus, we conclude that the longer the lengths of training sequences (relative to the test sequence), the better the performance in terms of exponential decay rate of typeI error probabilities for all pairs of distributions. In the extreme case in which , i.e., the training sequence is arbitrarily short compared to the test sequence, we conclude that typeI error probability cannot decay exponentially fast. However, in the other extreme in which , we conclude that typeI error probabilities for all pairs of distributions decay exponentially fast with the dominant (firstorder) term being . This implies that we can achieve the optimal decay rate determined by the ChernoffStein lemma [1] for binary hypothesis testing. Intuitively, this occurs since when
, we can estimate the true pair of distributions with arbitrarily high accuracy (using the large number training samples). In fact, we can say even more. Based on the formula in (
15), we deduce that, , the relative entropy variance, so we recover Strassen’s seminal result [3, Theorem 1.1] concerning the secondorder asymptotics of binary hypothesis testing.Finally, we remark that the binary classification problem is closely related with the socalled two sample homogeneity testing problem [9, Sec. IIC] and the closeness testing problem [10, 11, 12] where given two i.i.d. generated sequences and , one aims to determine whether the two sequences are generated according to the same distribution or not. Thus, in this problem, we have the following two hypotheses:

: the two sequences and are generated according to the same distribution;

: the two sequences and are generated according to different distributions.
The task in such a problem is to design a test . Given any and any , the falsealarm and miss detection probabilities for such a problem are
(18)  
(19) 
where in , the random variables and are both distributed i.i.d. according to and in , and are distributed i.i.d. according to and respectively. Paralleling our setting for the binary classification problem, we can study the following fundamental limit of the two sample hypothesis testing problem:
(20) 
Corollary 3.
For any , any and any , we have
(21) 
Since the proof is similar to that of Theorem 2, we omit it. Corollary 3 implies that Gutman’s test is secondorder optimal for the two sample homogeneity testing problem. We remark that for the binary classification problem without rejection (i.e., we are not allowed to declare the neither nor is true), the problem is essentially the same as the two sample hypothesis testing problem except that we have one more training sequence. However, as shown in Theorem 2, the second training sequence is not useful in order to obtain secondorder optimal result. This asymmetry in binary classification problem is circumvented if one also considers a rejection option as will be demonstrated in Section IV.
(a) TypeII Error Probability  (b) Logarithm of the Maximal TypeI Error Probability 
IiiC Numerical Simulation for Theorem 2
In this subsection, we present a numerical example to illustrate the performance of Gutman’s test in (4) and the accuracy of our theoretical results. We consider binary sources with alphabet . Throughout this subsection, we set .
In Figure 1(a), we plot the typeII error probability for a particular pair of distributions where and . The threshold is chosen to be the secondorder asymptotic expansion
(22) 
with target error probability being set to . Each point in Figure 1(a) is obtained by estimating the average error probability in the following manner. For each length of the test sequence , we estimate the typeII error probability of a single Gutman’s test in (4) using independent experiments. From Figure 1(a), we observe that the simulated error probability for Gutman’s test is close to the target error probability of as the length of the test sequence increases. We believe that there is a slight bias in the results as we have not taken the thirdorder term, which scales as into account in the threshold in (22).
In Figure 1(b), we plot the natural logarithm of the theoretical upper bound and the maximal empirical typeI error probability over all pairs of distributions . We set the fixed pair of distributions to be and and choose . We ensured that the threshold in (22) is small enough so that even if is large, the typeI error event occurs sufficiently many times and thus the numerical results are statistically significant. From Figure 1(b), we observe that the simulated probability lies below the theoretical one as expected. The gap can be explained by the fact that the method of types analysis is typically loose nonasymptotically due to a large polynomial factor. A more refined analysis based on strong large deviations [24, Theorem 3.7.2] would yield better estimates on exponentially decaying probabilities but we do not pursue this here. However, we do note that as becomes large, the slopes of the simulated and theoretical curves become increasingly close to each other (simulated slope at is ; theoretical slope at is ), showing that on the exponential scale, our estimate of the maximal typeI error probability is relatively tight.
IiiD Analysis of Gutman’s Test in A Dual Setting
In addition to analyzing , one might also be interested in decision rules whose typeI error probabilities for all pairs of distributions are nonvanishing and whose typeII error probabilities for a particular pair of distributions decays exponentially fast. To be specific, for any decision rule , we consider the following nonasymptotic fundamental limit:
(23) 
This can be considered as a dual to the problem studied in Sections IIIA to IIIC. We characterize the asymptotic behavior of when .
To do so, we recall that the Rényi divergence of order [13] is defined as
(24) 
Note that , the usual relative entropy.
Proposition 4.
For any , any and any pair of distributions ,
(25) 
First, the performance of Gutman’s test in (4) under this dual setting is dictated by , which is different from in Theorem 2. Intuitively, this is because of two reasons. Firstly, for the typeI error probabilities to be upper bounded by a nonvanishing constant for all pairs of distributions, one needs to choose (implied by the weak convergence analysis in [9]). Consequently, the typeII exponent then satisfies
(26) 
Second, as , the exponent and thus the typeII error probability does not decay exponentially fast. However, when , the exponent and thus we can achieve the optimal exponential decay rate of the typeII error probability as if and were known (implied by the ChernoffStein lemma [1]).
Iv Classification of Multiple Hypotheses with the Rejection Option
In this section, we generalize our secondorder asymptotic result for binary classification in Theorem 2 to classification of multiple hypotheses with rejection [5, Theorem 2].
Iva Problem Formulation
Given training sequences generated i.i.d. according to distinct distributions , in classification of multiple hypotheses with rejection, one is asked to determine whether a test sequence is generated i.i.d. according to a distribution in or some other distribution. In other words, there are hypotheses:

for each : the test sequence and training sequence are generated according to the same distribution;

: the test sequence is generated according to a distribution different from those in which the training sequences are generated from.
In the following, for simplicity, we use to denote , to denote and to denote . Recall that for brevity. The main task in classification of multiple hypotheses with rejection is thus to design a test . Note that any such test partitions the sample space into disjoint regions: acceptance regions where favors hypothesis and a rejection region where favors hypothesis .
Given any test and any tuple of distributions , we have the following error probabilities and rejection probabilities: for each ,
(27)  
(28) 
where similarly to (1) and (2), for , we define where is distributed i.i.d. according to for all . We term the probabilities in (27) and (28) as type error and rejection probabilities respectively for each .
Similarly to Section III, we are interested in the following question. For all tests satisfying (i) for each , the type error probability decays exponentially fast with the exponent being at least for all tuples of distributions and (ii) for each , the type rejection probability is upper bounded by a constant for a particular tuple of distributions, what is the largest achievable exponent ? In other words, given , we are interested in the following fundamental limit:
(29) 
IvB Main Result
For brevity, let . Given any , for each , let
(30) 
Consider any such that the minimizer for in (30) is unique for each and denote the unique minimizer for as . For simplicity, we use to denote when the dependence on is clear.
From Gutman’s result in [5, Thereoms 2 and 3], we conclude that
(31) 
In this section, we refine the above asymptotic statement, and in particular, derive the secondorder approximations to the fundamental limit .
Given any tuple of distributions and any vector , let
(32)  
(33) 
Theorem 5.
For any , any and any tuple of distributions satisfying that the minimizer for is unique for each , we have
(34) 
where (34) holds for any .
First, in the achievability proof, we make use of a test proposed by Unnikrishnan [14, Theorem 4.1] and show that it is secondorder optimal for classification of multiple hypotheses with rejection.
Second, we remark that it is not straightforward to obtain the results in Theorem 5 by using the same set of techniques to prove Theorem 2. The converse proof of Theorem 5 is a generalization of that for Theorem 2. However, the achievability proof is more involved. As can be gleaned in our proof in Section VC, the test by Unnikrishnan (see (107)) outputs rejection if the second smallest value of is smaller than a threshold . The main difficulty lies in identifying the index of the second smallest value in . Note that for each realization of , such an index can potentially be different. However, we show that for any tuple of distributions satisfying the condition in Theorem 5, if the training sequences are generated in an i.i.d. fashion according to , with probability tending to one, the index of the second smallest value in under hypothesis is given by . Equipped this important observation, we establish our achievability proof by proceeding similarly to that of Theorem 2.
Finally, we remark that one might also consider tests which provide inhomogeneous performance guarantees under different hypotheses in terms of the error probabilities for all tuples of distributions and, at the same time, constrains the sum of all rejection probabilities to be upper bounded by some . In this direction, the fundamental limit of interest is
(35) 
Characterizing the secondorder asymptotics of the set for is challenging. However, when , using similar proof techniques as that for Theorem 5, we can characterize the following secondorder region [18, Chapter 6]
(36) 
Indeed, one can consider the following generalization of Gutman’s test [5, Theorem 2]
Comments
There are no comments yet.