Information theory was largely developed in the context of communication systems, where information theoretic tools play an important role in characterizing the performance limits of such systems. However, another important area where information theoretic approach has proved useful is in statistical inference, e.g., hypothesis testing. For parametric hypothesis testing problems, information theoretic tools such as joint typicality, the equipartition property, and Sanov’s theorem have been developed to characterize the error exponent [1, 2, 3]. Information theory has also been applied to investigate a class of parametric hypothesis testing problems [4, 5], where correlated data samples are observed over multiple terminals and data compression needs to be carried out in a decentralized manner. Additionally, information theory has also been applied to solve nonparametric hypothesis testing problems under the Neyman-Pearson framework [6, 7].
In this paper, we apply information theoretic tools to study the nonparametric hypothesis testing problem, but with a focus on the average error probability instead of the Neyman-Pearson formulation. We address a more general scenario, where each hypothesis corresponds to a cluster of distributions. Such a nonparametric problem has not been thoroughly explored in the literature. We develop two nonparametric tests based respectively on the maximum mean discrepancy (MMD) and the Kolmogorov-Smirnov (KS) distance, and characterize the exponential error decay rate for these tests. Furthermore, in contrast to previous works where the number of hypotheses is assumed to be fixed, we study the regime where the number of hypotheses scales along with the sample size. This is analogous to the information theoretic channel coding problem where the number of messages scales along with the codeword length. Hence, in our study, information theory not only provides a technical tool to analyze the performance, but also provides an asymptotic perspective for understanding nonparametric hypothesis testing problems in the regime where the number of hypotheses is large, i.e., in the large-hypothesis large-sample regime.
More specifically, this paper assumes that there are hypotheses, each corresponding to a (cluster of) distributions, which are unknown. Sequences of length- training data samples generated by each distribution are available. The more general case with the training data sequences having different lengths is discussed in Section III-D. Suppose a length- test data stream is observed, which are samples generated by one of the distributions. The goal is to determine the cluster that contains the distribution that generated the observed test sequence. We are interested in the large-hypothesis regime, in which , i.e., the number of hypotheses scales exponentially in the number of samples with a constant rate . The analogy to the channel coding problem  is now apparent where the exponent represents the transmission rate, i.e., the transmitted bits per channel use, for the channel coding problem, here represents the number of hypothesis bits that can be distinguished per observation sample. Correspondingly, we refer to as the discrimination rate, and the largest such value is referred to as the discrimination capacity. The notion of discrimination capacity provides the fundamental performance limit for the hypothesis testing problem in the large-hypothesis and large-sample regime.
I-a Main Contributions
This paper makes the following major contributions.
We provide an asymptotic viewpoint to understand the nonparametric hypothesis testing problem in the regime where the number of hypotheses scales exponentially in the sample size. Based on its connection to the channel coding problem, we introduce the notions of the discrimination rate and the discrimination capacity as the performance metrics in such an asymptotic regime.
We develop two nonparametric approaches to solve the hypothesis testing problem that are based respectively on the maximum mean discrepancy (MMD) and the Kolmogorov-Smirnov (KS) distance. For both tests, we derive the corresponding error exponents and the discrimination rates. Our results show that as long as the number of hypotheses does not scale too fast, i.e., the scaling (discrimination) exponent is less than a certain threshold, the derived tests are exponentially consistent, i.e., the error probability converges to zero exponentially fast.
We also derive an upper bound on the discrimination capacity, which serves as an upper limit beyond which exponential consistency cannot be achieved by any nonparametric composite hypothesis testing rule.
I-B Related Work
-ary hypothesis testing: For parametric hypothesis testing problems, information theoretic tools have been developed to characterize the error exponent [1, 2, 3, 8], and to study a class of distributed parametric hypothesis testing problems [4, 5, 9]. For sequential multi-hypothesis testing, information theoretic bounds on the sample size subject to constraints on the error probabilities have been developed in . A generalization of the classical hypothesis testing problem is studied in , where a Bayesian decision maker is designed to enhance its information about the correct hypothesis. Information theory has also been applied to study nonparametric hypothesis testing problems with the primary focus being on the Neyman-Pearson formulation [6, 7]. An information-theoretic approach to the problem of a nonparametric hypothesis test with a Bayesian formulation is presented in 
. By factorizing dependent variables into mutually independent subsets, it has been shown that the likelihood ratio can be written as the sum of two sets of Kullback-Leibler divergence (KLD) terms, which is then used to quantify loss in hypothesis separability. Our study is different from the previous studies on nonparametric hypothesis testing problems in that we focus on the asymptotic regime where the number of hypotheses scales with sample size.
The problem we study here can also be viewed as a supervised learning problem studied in the machine learning literature. However, the problem formulated here is different from the traditional supervised learning problem
, where sample points corresponding to the same label are simply treated as individual samples, and their underlying statistical structure is not exploited in the design of classification rules. For example, the support vector machine (SVM) is one of the important classification algorithms for supervised learning, where the distance between samples is measured either by the Euclidean distance or by a kernel-based distance. Such distances do not exploit the underlying statistical distributions of data samples. A robust form of the SVM in incorporates the probabilistic uncertainty into the maximization of the margin. Our formulation exploits the underlying probabilistic structure of data samples, which is also robust to missing data, system noise, etc.
A formulation of the supervised learning problem that is similar to our formulation has been studied previously in 
. The proposed approach, therein named support measure machine (SMM), exploits the kernel mean embedding to estimate the distance between probability distributions. In fact, the comparison between an SMM and an SVM also reflects the differences between our formulation and the traditional supervised learning problem. However, the study in focused only on the regime with finite and fixed number of classes, and did not characterize the decay exponent of the error probability, whereas our focus is mainly on the asymptotic regime with infinite number of classes, and on the scaling behavior of the number of classes under which an asymptotically small error probability can be guaranteed. Nevertheless, the kernel-based approach developed in  as well as in various other papers [16, 17, 18] provide important techniques that we exploit in our study.
Information theory in learning:
Quite a few recent studies have applied various notions in information theory for studying supervised learning problems. A minimax approach for supervised learning, where the goal is to minimize the worst-case expected loss function over a certain set of probability distributions was developed in. The designed classification rules are expected to be robust over datasets generated by any probability distribution in the set. A classification problem, where the observation is obtained via a linear mapping of a vector input was studied in . The notion of classification capacity was proposed, which is similar to the discrimination capacity we propose. However, the results in  are derived under the Gaussian model, whereas our formulation does not assume any specific distributions and is hence much more general. Furthermore, a parametric setting is implicitly assumed in , whereas our focus is on the nonparametric problem. A connection between the hypothesis testing problems and channel coding was established in 
, compared to which this paper focuses primarily on the asymptotic case where the number of classes can scale. A supervised learning problem, where the joint distribution of the data sample and its label is assumed to be known but with an unknown parameter, was studied in
. A classifier was proposed and the corresponding performance was analyzed. The connection of the problem to rate-distortion theory was explored. There are several key differences between the work in and our study. There is no notion of discrimination rate in , and the performance is not defined in terms of the asymptotic classification error probability. Additionally, our study does not assume any joint distribution of both the data sample and its label.
Ii Problem Formulation
In this section, we first describe our composite nonparametric hypothesis model, and then connect it to the channel coding problem, which motivates several information theory related definitions that we will use to characterize system performance. For ease of readability, we also give preliminaries on the parametric hypothesis testing problem.
Ii-a Supervised Learning as Nonparametric Hypothesis Testing
Consider the following nonparametric hypothesis testing problem with composite distributions. Suppose there are hypotheses, and each hypothesis corresponds to a set of distributions for . For a given distance measure between two probability distributions and , we define
Hence, represents the diameter of the -th distribution set and represents the inter-set distance between the th and the th sets.
We assume that
That is, the intra-set distance (diameter) is always smaller than the inter-set distance for the composite hypothesis testing problem. The actual values of and depend on the distance metrics used and are different. Furthermore, and in (2) require that the conditions hold in the limit of asymptotically large , i.e., the limit taken over the sequences of distribution clusters. We study the case where none of the distributions in the sets for are known. Instead, for , we assume that each distribution , where is the index of the distribution, generates one training sequence consisting of independently and identically distributed (i.i.d.) scalar training samples. We use to denote all training sequences generated by the distributions in . We assume that a test sequence of i.i.d. scalar samples is generated by one of the distributions in one of the sets for . The goal is to determine the hypothesis that the test sequence belongs to, i.e., which set contains the distribution that generated . Note that since can scale with the number of samples (as we describe in the sequel), the assumption (2) should hold in the asymptotical regime as .
A practical example of the considered problem involves nonparametric detection of micro-Doppler modulated radar returns, such as those which occur in a ground moving target indicator (GMTI) radar . The micro-Doppler motion of a particular target generates a specific sideband structure, which varies within a distributional radius as the fundamental frequency of the target’s micro motion changes, i.e., . The difference between the fundamental sideband structure of the micro-Doppler modulations between different target types implies a distributional difference, i.e., . This type of classification problem is clearly composite (based on an unknown fundamental modulation frequency), and a parametric realization is in many cases impractical as the specific physics of the movement can be very difficult to model in a closed form.
Let denote a test based on the given data. Then, the error probability for is defined as
where is the a priori probability that is drawn from the -th set of distributions.
For the above -ary hypothesis testing problem, we are interested in the regime, in which the number of hypotheses scales with the number of samples. In particular, we assume , where the parameter captures how fast scales with . We refer to as the discrimination rate.
We say that the discrimination rate is achievable, if there exists a classification rule for the multi-hypothesis testing problem such that the probability of error converges to zero as the number of observation samples converges to infinity.
For a given composite hypothesis testing problem, we define the largest possible discrimination rate, , to be the discrimination capacity, and denote it as .
Ii-B Connection to the Channel Coding Problem
Next, we discuss the connection between the asymptotic regime of the hypothesis testing problem and the channel coding problem studied in communications, which in fact motivated our definition of the discrimination rate and the discrimination capacity.
In the channel coding problem (see Figure 1(a)), assume there are messages to be transmitted with equal probability. An encoder maps each message one-to-one onto a length- codeword , which is transmitted over the channel. The channel maps each input symbol to an output symbol in a discrete memoryless fashion with the transition probability for each channel use, and the corresponding output sequence is given by . A decoder then estimates the original message as based on the output sequence. Essentially, in the channel coding problem, there are a total of possible conditional distributions given , where , and the decoder determines which distribution most probably generated the observed channel output .
The decoding process of the channel coding problem described above is a hypothesis testing problem. Inspired by the channel coding problem, our total number of hypotheses corresponds to the total number of messages in channel coding, and the discrimination rate we define corresponds to the communication rate in channel coding, which represents the transmitted message bits per coded symbol. By analogy, the discrimination rate can be interpreted as the number of class-bits that can be distinguished per observation sample. Similarly, the discrimination capacity corresponds to the capacity in channel coding, and serves as the fundamental testing limit in hypothesis testing problems. Note that in channel coding, the transmitter can choose to shape the distributions of transmitted symbols. Here, the hypothesis testing problem corresponds to the case where the distributions remain unshaped.
Essentially, Shannon’s channel coding theorem guarantees error-free transmission of an exponentially increasing number of messages provided that the transmission rate is less than the channel capacity . In other words, Shannon’s theorem implies that codewords
can be designed such that exponentially increasing number of conditional probability distributions can be distinguished given the channel output. Here, for the hypothesis testing problem, channel coding motivates us to investigate the following problems:
Which tests distinguish an exponentially increasing number of hypotheses with asymptotically small error probability based on observation samples?
What are the corresponding discrimination rates?
Ii-C Preliminaries on Parametric Hypothesis Testing
The aforementioned questions can be answered for the parametric hypothesis testing problem in the asymptotic regime based on existing studies, e.g., . We explain this in detail for single distributions below as preliminary material before we delve into the main focus of this paper on the nonparametric composite hypothesis testing problem.
Consider the parametric hypothesis testing problem, where there are known distinct distributions corresponding respectively to hypotheses. Given a test sequence consisting of i.i.d. samples generated from one of these distributions, the goal is to determine which hypothesis is true, i.e., which distribution generated the test sequence.
We apply the likelihood test given by:
where is the KLD between two distributions, and is the empirical distribution of the sequence. It suggests that the testing rule labels the test data as hypothesis if the empirical distribution of the test data is closest to in terms of KLD.
We next analyze the average error probability of the above testing rule as follows.
where denotes the Chernoff distance
and denotes the event that given , the KLD between and is greater than the KLD between and for some , i.e., for ,
Note that for simplicity, the default base for in this paper is 2. Thus, if , then the error probability is asymptotically small as goes to infinity, which proves the following proposition.
For the parametric multiple hypothesis testing problem, the discrimination rate is achievable if
Hence, for the discrimination rate to be positive, we require that the smallest pairwise Chernoff information be bounded away from zero for asymptotically large , i.e., the limit taken over the sequences of distribution clusters.
Iii Main Results
In this section, we obtain the performance bounds for the nonparametric hypothesis testing problem, with two different distance measures, i.e., MMD and KS distance.
Iii-a MMD-Based Test
We construct a nonparametric hypothesis test based on the MMD distance between two distributions and  defined as follows
where maps a distribution into an element in a reproducing kernel Hilbert space (RKHS) associated with a kernel as
An unbiased estimator ofbased on samples of generated by distribution and samples of generated by distribution , is given by :
Note that , and the dimension .
We employ the MMD to measure the distance between the test sequence and the training sequences, and declare the hypothesis of the test sequence to be the same as the training sequence that has the smallest MMD to the test sequence. The constructed MMD-based nonparametric composite hypothesis test is given by
The following theorem characterizes the average probability of error performance of the proposed MMD-based test under composite distributions.
See Appendix -A. ∎
Next, we study a special case where each hypothesis is associated with a single distribution, i.e., the -th hypothesis is associated with only one distribution , . Then, we have the following corollary.
Suppose the MMD-based test is applied to the nonparametric hypothesis testing problem under assumption (2), and each hypothesis is associated with a single distribution, where the kernel satisfies for all . Then, the average probability of error under equally probable hypotheses is upper bounded as
Thus, the achievable discrimination rate is
Note that, for the discrimination rate to be positive, we require the smallest pairwise MMD between the distributions to be bounded away from zero for asymptotically large , where the limit is taken over the sequences of distribution clusters.
By Theorem 1, we set and . Therefore, we can bound the probability of error as the number of classes scales according to
Then, it is straightforward to obtain thebuyong achievable discrimination rate for the MMD test as
Iii-B Kolmogorov-Smirnov Test
In this section, we construct a nonparametric hypothesis testing test based on the KS distance defined as follows. Suppose , and i.i.d. samples are generated by the distribution . Then the empirical CDF of is given by
where is the indicator function. The KS distance between and having respectively been generated by and is defined as
We construct the following KS based nonparametric composite hypothesis test
The following theorem characterizes the performance of the proposed KS-based test.
See Appendix -B. ∎
We next consider the case where each hypothesis is associated with a single distribution. Then, we have the following corollary.
Suppose the KS-based test is applied to the nonparametric hypothesis testing problem under assumption (2), and each hypothesis is associated with a single distribution. Then, the average probability of error under equally probable hypotheses is upper bounded as
Thus, the achievable discrimination rate is
Hence, for the discrimination rate to be positive, we require the least pairwise KS distance between distributions to be bounded away from zero for asymptotically large , where the limit is taken over the sequences of distribution clusters.
By Theorem 2, we set and , and have
Then, it is straightforward to obtain the following achievable discrimination rate for the KS test as
Iii-C Upper Bound on the Discrimination Capacity
In this section, we provide an upper bound on the discrimination capacity for the composite hypothesis testing problem. Let be a random index representing the actual hypothesis that occurs. We assume that
is uniformly distributed over thehypotheses, and has the same distribution as , but is independent from . Then, Lemma 2.10 in  directly yields the following upper bound on the discrimination capacity .
The discrimination capacity is upper bounded as
where is the KLD between two distributions.
Note that the above limit is taken over the sequences of distribution clusters.
In Appendix -C, we provide an alternative but simpler proof based on Fano’s inequality for the above upper bound, which is closely related to the proposed concept of discrimination capacity.
Iii-D Training Sequences of Unequal Length
In this subsection, we discuss the impact of different number of training samples in different classes on the probability of error and the discrimination rate. Here, we still assume that there are test samples. To keep the problem formulation meaningful, we assume that the number of classes increases exponentially with at a rate , i.e., . To avoid notational confusion, we use the non-composite case, i.e., with each class corresponding to one distribution, to illustrate the idea. Suppose that each class, i.e., each distribution, generates training samples, for , where represents the number of samples in the -th class (as a function of ). Let . In particular, for the MMD-based test, the probability of error can be bounded as
For the KS-based test, the probability of error can be bounded as
It can be seen that here the ratio plays an important role in determining the error exponent asymptotically. For example, for the MMD-based test, if the ratio converges to zero for large , i.e., the shortest training length scales as an order-level slower than the test length, then there is no guarantee of exponential error decay, and the discrimination rate equals zero. On the other hand, if with , then the discrimination rate . Furthermore, if with , then the discrimination rate . A sketch of the proof of (26) and (27) can be found in Appendix -D
Iv Numerical Results
In this section, we present numerical results to compare the performance of the proposed tests. In the experiment, the number of classes is set to be five, and the error probability versus the number of samples for the proposed algorithms is plotted. For the MMD based test, we use the standard Gaussian kernel given by .
In the first experiment, all the hypotheses correspond to Gaussian distributions with the same variance but different mean values . A training sequence is drawn from each distribution and a test sequence is randomly generated from one of the five distributions. The sample size of each sequence ranges from to . A total of monte carlo runs are conducted. The simulation results are given in Figure 2. It can be seen that all the tests give better performance as the sample size increases. We can also see that the MMD-based test slightly outperforms the KS-based test. We also provide results for the parametric likelihood test as a lower bound on the probability of error for performance comparison. It can be seen that the performance of the two nonparametric tests are close to the parametric likelihood test even with a moderate number of samples.
In the second experiment, all the hypotheses correspond to Gaussian distributions with the same mean but different variance values . The simulation results are given in Fig. 3. In this experiment, the MMD-based test yields the worst performance, which suggests that this method is not suitable when the distributions overlap substantially with each other. The two simulation results also suggest that none of the three tests perform the best universally over all distributions. Although there is a gap between the performance of MMD and KS tests and that of the parametric likelihood test, we observe that the error decay rates of these tests are still close.
To show the tightness of the bounds derived in the paper, we provide a table (See Table I) of error decay exponents (and thus the discrimination rates) for different algorithms.
|Lower Bounds||Upper Bounds|
Estimates of error decay exponent of KS and MMD based tests on a multi-hypothesis testing problem are presented for the problem considered in the first experiment. Note that the theoretical lower bounds in the table correspond to the achievable discrimination rates of the methods asymptotically. Fano’s bound (FB in the table) is estimated by using data-dependent partition estimators of Kullback-Leibler divergence . The parametric upper bound is based on the maximum likelihood test, which can serve as an upper bound on the error decay exponent (and hence intuitively on the discrimination capacity). It can be seen from the table that both the KS and MMD tests do achieve an exponential error decay and have positive discrimination rates as we show in our theorems. Clearly, the empirical values of the bounds for both tests are better than the corresponding theoretical values. More importantly, both of the empirical lower bounds are close to the likelihood upper bound, demonstrating that the actual performance of the two tests are satisfactory. We also note that the Fano’s upper bound is not very close to the lower bound.
To better illustrate the bounds in Table I, we provide experimental results with different number of hypotheses in Figure 4. In particular, we present the simulation results with . We use a similar experiment setting as that in the first experiment, where Gaussian distributions have the same variance and different mean values, and the mean values are and respectively. The parametric maximum likelihood test serves as an unpper bound for the error decay exponent for all of the three cases. Similar to the case , KS and MMD nonparametric tests achieve an exponential error decay and hence the positive discrimination rates for the cases and .
We now conduct experiments with composite distributions. First, we still use five hypotheses with Gaussian distributions with variance and different mean values . For each hypothesis, we vary the mean values by . Thus, within each hypothesis, there are three different distributions with mean values in . The results are presented in Figure 5. As expected, the performance improves as the sample size increases. The two tests perform almost identically, with the MMD-based test slightly outperforming the KS-based test for small .
We again vary the variances of the Gaussian distributions as in the second experiment in a similar way. In particular, the variances in the same class are , and . In Figure 6, we observe the performance improvement as the sample size increases. Different from the results in the second experiment, the MMD-based test outperforms the KS-based test in the composite setting.
This paper developed a nonparametric composite hypothesis testing approach for arbitrary distributions based on the maximum mean discrepancy (MMD) and Kolmogorov-Smirnov (KS) distance measure based tests. We introduced the information theoretic notion of discrimination capacity that was defined for the regime where the number of hypotheses scales along with the sample size. We also provided characterization of the corresponding error exponent and the discrimination rate, i.e., a lower bound on the discrimination capacity. Our framework can been extended to unsupervised learning problems and similar performance limits can be investigated.
-a Proof of Theorem 1
The proof uses the following inequality.
To apply the McDiarmid’s inequality, we first define the following quantity
where consists of data samples.
Given , i.e., the test sequence is generated by , it can be shown that
We next define the same as except that the -th component is removed. We also define as another sequence generated by the same underlying distribution for . Then, affects via the following three cases.
Case 1: is in the sequence . In this case, affects through the following terms
Case 2: is in the sequence . In this case, affects through the following terms
Case 3: is in the sequence . In this case, affects through the following terms
Thus, since the kernel is bounded, i.e., for any , considering the above three cases, the variation in the value of when varies is bounded by . Then, we can obtain the following bound
Assuming the test sequence is generated from the true underlying set of distributions, , we now apply Lemma 1 and obtain the following bound on the probability of error between two classes
Therefore, we can bound the probability of error as
Thus, the achievable discrimination rate is
-B Proof of Theorem 2
We first introduce two lemmas to help establish the theorem.
 Suppose is generated by and is the corresponding empirical c.d.f.. Then
Suppose two distribution clusters and satisfy (2). Assume that for , satisfying . Then for any satisfying ,
By the triangle inequality and the property of supremum, we have
where . Then
where . By the continuity of the exponential function, we have
Without loss of generality, assume that the probability that is generated from is for all and . By Lemma 3 and the union bound, the probability of error is bounded by
Thus, the achievable discrimination rate is
-C Proof of Remark 1
By Fano’s inequality , we obtain
Since is uniformly distributed over all the hypotheses, we have that
Let , , and represent the marginal and joint distributions of and . Recall that we represent the likelihood function of under as . The mutual information between and can be expressed in terms of likelihood functions as