Finite Sample-Size Regime of Testing Against Independence with Communication Constraints

10/28/2019 ∙ by Sebastian Espinosa, et al. ∙ 0

The central problem of Hypothesis Testing (HT) consists in determining the error exponent of the optimal Type II error for a fixed (or decreasing with the sample size) Type I error restriction. This work studies error exponent limits in distributed HT subject to partial communication constraints. We derive general conditions on the Type I error restriction under which the error exponent of the optimal Type II error presents a closed-form characterization for the specific case of testing against independence. By building on concentration inequalities and rate-distortion theory, we first derive the performance limit in terms of the error exponent for a family of decreasing Type I error probabilities. Then, we investigate the non-asymptotic (or finite sample-size) regime for which novel upper and lower bounds are derived to bound the optimal Type II error probability. These results shed light on the velocity at which the error exponents, i.e. the asymptotic limits, are achieved as the samples grows.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Motivated by emerging applications in sensor networks, the signal processing community has been involved in numerous research initiatives to study decision and inference problems in context of partial or noisy data that has been corrupted by different types of degradations. These degradations come from the imperfect nature of the sensors, the communication restrictions between sensors and decision making process in a distributed-remote setting, or by the presence of external sources of perturbations corrupting data [1]. An emerging domain on this data-corrupted context is what is known as signal processing in the context of unlabeled or unordered data [2, 3, 4, 5, 6, 7, 8] and, more classically, binary decisions in the context of distributed systems where data comes for decision making after lossy source coding is performed [9, 10, 11]. In both scenarios, the derivation of performance limits and algorithms that achieve those limits have been relevant topics.

The focus of this paper is on the second family of problems, i.e., optimal binary decision from compressed data, where it is relevant to understand the effects of lossy compression on the performance of the inference task. In particular, we revisit a scenario in presence of partial rate-constraint that was first introduced by Ahlswede & Csiszar in [9]. This problem consists of a test against independence where the observations -sample measurements- come from two modalities (e.g., sensors) in a distributed fashion, as shown in Figure 1. In particular, one of the modalities has to be transmitted from the sensor to the detector using a free-error communication channel but subject to a rate-constraint in bits per-samples. The main problem here is to study optimal coding-inference strategies to characterize optimal performance that can be achieved using a finite number of samples, i.e., non-asymptotic analysis, as well as the optimal error exponent of the task when the number of samples tends to infinity. A relevant technical objective here is to derive tight performance bounds for this task and to assess the effect of the rate-limited communication on the performance of the test.

The present paper extends the seminal works in [9] and [12]. In particular, [9] derived a closed-form expression for the error exponent of Type II error given a fixed restriction on the Type I error ([9, Ths. 2 and 3]. Importantly, the results show the effect of the communication rate in the error exponent which is shown to be asymptotically independent of .111The general case of distributed HT was first considered in [12] and partial results of optimality were reported (see [13, 14, 10] and references therein). Complementary, Han et al. [10] determined a lower bound for the error exponent when the Type I restriction as a sequence vanishes at the exponential rate .

I-a Contribution

Building on previous works, we first study a family of problems when the Type I restriction goes to zero with sample size assess the impact of this stringer set of restrictions on the error exponent of Type II error of an optimal coding-inference scheme. Building on concentration inequalities and rate-distortion theory, our first main result presented in Theorem 1 gives new conditions on the converge rate of the Type I error under which the error exponent limit is obtained in closed-form. In particular, for a family of sub-exponential decreasing Type I error restrictions, we show that the error exponent matches the expression presented in [9, Theorem 3]. Surprisingly, this result is consistent with similar matching condition obtained for the communication-free problem [14]. Furthermore, this implies that the coding rate restriction does not affect the error exponent of the Type II error.

From a practical motivation, it would be relevant not only to analyze error exponent expressions, i.e. obtained when the number of samples tends to infinity, but to study finite-length performance bounds. In this sense, the second main contribution of this paper is on a non-asymptotic analysis of the Type II error of an optimal coding-inference scheme. Theorem 2 offers upper and lower bounds for the Type II error probabilities as a function of the number of observations, the involved distribution of the problem and the restriction on the Type I error. These new finite-length bounds shed light on the velocity at which the error exponent is achieved as the number of samples tends to infinity, and consequently, how well the performance limits matches real performance with finite sample size.

I-B Related works

Blahut [15], Hoeffding [16] and Han [13] studied the classical Hypothesis Testing (HT) problem when the Type I error restriction is of an exponential-decreasing type. Nakagawa et. al [14] extended this asymptotic limit for any decreasing sequence of the Type I restriction. These results are important but focused on the classic scenario for i.i.d. sequences of observations. Notice that this structure is not longer valid for the communication setting introduced in [9] due to the presence of data compression. In temers of the non-asymptotic analysis, Strassen [17] derived concrete non-asymptotic result for the optimal Type II error under a constant Type I error restriction assuming a communication-free setup. It is worth to emphasize that a discrepancy between optimal finite-length and asymptotic performance in terms of the error exponent is observed for scaling . In the same (communication-free) framework, Sason [18] borrows ideas from moderate deviation analysis [19] to obtain an interesting upper bound for the Bayesian error probability by bounding the Type I and Type II errors in such a way that both decay to zero sub-exponentially with . More recently, Watanabe [11] provided error exponent and non-asymptotic bounds for the case in which messages are sent to a centralized decoder with zero-rate (asymptotically) in bits-per sample, which is different from the setting of this paper in the sense that we consider the more realistic use of a fixed rate to transmit information from the sensor to the detector. Extensions to interactive HT with zero-rate have been also reported in [20]. However, in these two cases the authors assume a particular family of distributions, then an extension to a more general -fixed rate context- from these approaches is not feasible.
The rest of the paper is organized as follows: Section II introduces the problem of testing against independence with communication constraints and also revisits classical results from the unconstrained case. Section III presents the main theoretical results for the asymptotic and non-asymptotic regimes. Numerical analysis and discussions are relegated to Section IV. Finally, Section V concludes the paper. The proofs of the main results are presented in Section VI.

I-C Notations and conventions

Boldface letters and upper-case letters

are used to denote vectors and random vectors of length

, respectively. Let , and

be three random variables with probability measure

. If for each

, then they form a Markov chain, which is denoted by

. Let indicate and indicates that . We say that if for sufficiently large there exists a constant such that , for all .

Fig. 1: Illustration of the coding-decision problem with one-side communication constraint. is the encoder of (one of the modalities) and is the detector acting on the one-side compressed measurements .

Ii Preliminaries

We restrict our attention to the case of a finite alphabet product space , where denotes the family of probability measures on . A joint random vector with values in is equipped with a joint probability where and denote the marginals distributions of and , respectively. and denote the finite block vector with (i.i.d.) product distribution (the fold distribution). Let us consider the -length bivariate hypothesis test against independence given by

(1)

where and denote the product probability induced by the marginals of . To make the problem discriminable, it is assumed that [21]:

(2)

Let us present the one-sided communication constraint setting introduced in [9]. We define the pair of encoding and decision rule of length and rate (in bits per sample) by:

(3)

represents a fixed-rate (lossy) encoder of and

represents the decision rule (or classifier) acting on the one-sided compressed data

. For any pair of length and rate , we can introduce its Type I and Type II errors [22], [23]:

(4)
(5)

where . For any sequence of non-negative values such that , we are interested in the family optimal (encoder-decision) rules solutions of:

(6)

where the minimum is over the encoding and decision pairs of the form presented in (II). Then represents the optimum Type II errors given a sequence of fixed Type I restrictions.

Ii-a Unconstrained results

It is worth revisiting the non-distributed case where is the identity mapping and the solution of (6) is then denoted by . In addition when for all , the celebrated Stein’s Lemma implies [24, 21]:

Lemma 1 (Stein’s Lemma).

For any ,

determines the error exponent of Type II error that turns out to be independent of . Indeed, can be interpreted as the rate of information (per sample) to discriminate from in HT [23, 21]. For the case of an exponential decreasing Type I restriction, it follows:

Lemma 2.

[14, Nakagawa] Let us assume that for some . Then,

where is the probability given by , and is the solution of the condition .

Corollary 1.

From the proof of Lemma 2 [14, Sect. IX], if is then .

Therefore, the error exponent obtained with a fixed in Lemma 1 is preserved for a family of stringer decision problems in Eq. (6) as long as goes to zero at a sub-exponential rate.

Ii-B HT with one-sided compression

Returning to the original decision-compression task in (6), Ahlswede and Csiszár [9] determine the error exponent of this problem in closed-form (function of and ) when for all :

Lemma 3.

[9, Theorem 3] For any , it follows that222This result provides an interesting connection with the problem noisy lossy (fixed-rate) source coding using the log-loss (or cross entropy) as distortion metric [25]. The performance limits in the right hand side (RHS) of (7) coincides precisely with the distortion-rate function of the information bottleneck problem [26].

(7)

In the regime of decreasing Type I error, introduced in (6), several questions can be formulated: Does it exist a fundamental limit (error exponent) for the Type II error? If so, does it has a (single letter) characterization function of and ? Does this limit change depending on the rate of convergence to zero of the Type I error restriction? Han [12] offered a partial answer to these question providing a lower bound for the error exponent (of the Type II error) for exponentially decreasing Type I error restrictions:

Lemma 4.

[10, Han] Let us assume that for some , then:

(8)

where

denotes all test (quantizer) channels from to .

Iii Main Results

The first main result of this section complements Lemma 4 considering a sub-exponential regime in the rate of convergence to zero of the Type I error in the problem presented in (6). Importantly, Theorem 1 provides conditions under which the performance limit obtained in Lemma 3 is preserved.

Theorem 1.

Let us assume that is and is for some , then

(9)

This result establishes a large regime on the velocity at which goes to zero for which the error exponent of the problem is invariant and matches the expression in the simplest problem addressed in Lemma 3. It is important to emphasize that the problem in Lemma 3 is less restrictive that the regime when is and, from that perspective, this result is non-trivial and informative. In fact, and as Han mentioned in [10], there was no guarantee that this performance limit remains invariant when moving to monotonic behaviours on . Finally, this result can be considered a counterpart of what is known in the unconstrained case revisited in Corollary 1.

The proof of Theorem 1 is presented in Section VI-A and it is divided in two parts. The direct part (i.e., constructive argument) is based on the construction of an encoder-decision pair that guarantees that the error exponent of the optimal Type II is greater than . The second part of the argument (i.e., the infeasibility argument) proves that there is no pair of encoder-decision rule satisfying the operational restriction of the problem whose error exponent is greater than . More specifically, the encoder offers a finite-rate description (lossy) of to the decision maker. This restriction introduces a technical challenge in the sense that the encoder breaks the i.i.d. structure of the observations . Therefore, standard arguments constructed over typical sequences [21]

and the weak law of large numbers

[22] can not be adopted directly. In contrast, the proof techniques proposed in this work for both the achievable and infeasibility parts (proof of Theorem 1) are based on a refined used of concentration inequalities [27]. In particular, following the ideas presented in [9], the achievable argument is divided in two steps. The first step consists on reducing the problem to an i.i.d. structure over a block of induced by the encoder, which will concentrate (in probability) in an error exponent limit that is different from . Importantly, the discrepancy between the concentration limit obtained from our approach (i.e. finite-block strategy) and can be resolved analytically connecting our problem with a noisy rate distortion problem, where the discrepancy between its fundamental limit and a finite length version of this object is well understood [28]. The second step consist on optimizing our approach by giving concrete conditions on the terms presented in the discrepancy to ensure convergence to zero.

Iii-a Finite-length analysis

In order to complement Theorem 1, it is practically relevant to have a result about the finite-length regime of this task, function of , and , which are the three elements that define the problem. We are interested in (upper and lower) bounding the discrepancy between and and from this analysis determining the convergence to zero of this discrepancy as tends to infinity. The problem is challenging and requires the adoption and optimization of some of the arguments involved in the proof of Theorem 1. For this analysis, it turns out to be important the consideration of specific regimes for .

Theorem 2.

Assume that . Then,

  • If (logarithmic), it follows:

    (10)
    (11)
  • If (polynomial) with , then

    (12)
    (13)
  • If (polynomial) with , then

    (14)
    (15)
  • If (superpolynomial) with , then

    (16)
    (17)

The proof is presented in Section VI-B.

Iii-A1 Analysis and interpretation of Theorem 2

a) The results establish non-asymptotic bounds for the Type II error when we impose concrete scenarios for the monotonic behavior on . We explore three main regimes for : logarithmic, polynomial, super-polynomial. Each of these cases has its corresponding lower and upper bounds, which depends specifically on the scenario considered for .

b) The proof of Theorem 2 involves an optimization problem of the upper bound and lower bounds presented in the proof of Theorem 1. Specifically, we refine the analysis introduced in Eqs. (40), (42) and (50) by finding optimal values for and for a given . These choices of values for and give us non asymptotic lower and upper bounds for , for each scenario.

c) On the upper bound of (Eq. (11), Eq. (13), Eq. (15) and Eq. (17)), obtained from the impossibility argument (converse part), as goes to zero faster (from case to case), the velocity at which the bound tends to zero increases; from the slower rate to the faster that is . Therefore, by imposing a more restrictive there is an effect in the discrepancy between the fundamental limit and the optimal Type II error obtained from this upper bound analysis.

d) On the lower bound of (Eq. (10), Eq. (12), Eq. (14) and Eq. (16)), obtained from the direct argument (achievability part), as goes faster to zero (from case to case), the obtained bound -for the super-polynomial case- decreases in the velocity at which the discrepancy in error exponent tends to zero. For the other cases (logarithmic and polynomial), the velocity is not affected, but the constants change to slower magnitudes. This is because by relaxing the velocity of the problem is less restrictive and then the result favors the possibility of obtaining a better Type II error (smaller) than the one predicted by the performance limit, which is .

e) Finally, it is worth noting that if we consider the relaxed restriction , the achievability part of our argument works and for it offers an upper bound that converges to zero as . This rate of convergence is slower than the result known for the unconstrained problem presented in [17]. In fact, when is fully observed at the detector (see Lemmas 1 and 2), Strassen [17] showed that the discrepancy goes to zero as . Details are presented in Lemma 7 in Appendix B. We conjecture that this slower rate can be attributed to the non-trivial role of the communication constraint in our problem, which breaks the i.i.d. structure of in a way that it is not possible to use the tools adopted to derive the unconstrained result in Lemma 7. Then, it is topic of further research to uncover if the upper bound for the discrepancy can be improved, or if there is not possible (converse argument) to show that this rate is indeed optimal.

Iv Interpretation and Numerical Analysis

In this section, we interpret Theorem 2 and use it as a bound of the worse-case performance that one could have with an optimal decision scheme when operating with a finite number of observations. In concrete, the lower and upper bounds in Theorem 2 translate in an interval of feasibility for the optimal Type II error probability and, the length of that interval is an indicator of the precision of the result.

The results in Theorem 2 can be presented as two bounds:

(18)

where represents the optimal Type II error consistent with Type I error restriction (given by in the statement of Theorem 2), is the performance limit in Theorem 1, is a positive sequences that goes to zero with () representing the penalization (in error exponent) for the use of finite simple-size, and is a positive sequence that goes to zero representing a discrepancy with the limit but that can be seen as a possible gain in error exponent.

Then we have a feasibility range for given by the interval . This range contains the nominal value , which is the probability that is consistent with the error exponent limit in Theorem 1 but extrapolated to a finite length regime. If we consider as our reference (nominal), we can study two feasible regions: the pessimistic interval where the error probability is greater than the nominal value , and the optimistic interval where the appositive occurs. The length of the interval of the two regions is an indicator of the precision of the result (the worse case discrepancy with respect to the nominal value ) in the two scenarios. For the pessimistic region, the length of that interval is , which goes to zero exponentially fast with as . From the fact that is (see the statement of Theorem 2), the precision of the result goes to zero strictly faster than for any and, consequently, the precision has an exponential rate of convergence that is asymptotically given by the nominal exponent . On the optimistic region, we have error probabilities smaller (better) than the nominal value extrapolated from Theorem 1. The precision of this interval is , which is . Then the length of the pessimistic interval dominates the analysis and, consequently, the precision of the joint case (i.e., the worse case discrepancy with respect to the nominal on the whole range) goes to zero as , which is equivalent to the worse-case Type II probability error () that is predicted from this result.

Importantly, the overall quality of the result is governed by and affected in a smaller degree by how fast goes to zero. Note that plays no role from this perspective. We discuss on the previous section that goes faster to zero when we relax the problem passing from a scenario for to a scenario where this sequence goes to zero at a smaller velocity. This implies that the precision of Theorem 2 improves when simplifying the problem from one restriction to a relaxed restriction for the Type I error. This reinforces the point mentioned in the previous section, where we discuss that the velocity at which goes to zero does not affect the limit (Theorem 1) but it does affect the finite length result in this case through .

Iv-a Numerical examples

Here we illustrate numerically how precise is the prediction of the value evaluating the length of in some scenarios. In particular, we compute the lower and upper limits for from Theorem 2 in expression (18) expressed by:

(19)
(20)

where . We evaluate these bounds for three cases with , which corresponds to the polynomial, superpolynomial and logarithmic case, respectively. We compute Eqs. (19) and (20) by using a joint probability mass function of such that the mutual information between the two variables ( and ) is 10 nats (high mutual information scenario). An important part of this algorithm requires to compute , whose solution involves an optimization formulation with respect to the encoder and the rate [26]. For the computation of we use the algorithm presented in [29] which is a generalization of the Blahut-Arimoto algorithm [30].333Under some mild conditions given in [29], there exist guarantees to ensure this optimization converges to .

Blocklength
25 50 75 100 125 150 175 200
1.88e-17 1.68e-45 7.99e-78 1.02e-112 2.06e-149 1.77e-187 1.21e-226 1.03e-266
9.61e-10 3.12e-32 1.17e-59 4.75e-90 1.60e-122 1.41e-156 6.71e-192 2.86e-228
8.15e+03 5.62e-10 3.47e-29 5.02e-52 2.05e-77 9.69e-105 1.28e-133 8.85e-164
TABLE I: Range of across different values of the blocklength , polynomial case.
Blocklength
25 50 75 100 125 150 175 200
4.25e-11 3.09e-41 3.16e-76 5.33e-114 1.13e-153 9.23e-195 5.74e-237 4.37e-280
1.18e-04 1.25e-28 9.70e-58 7.68e-90 5.28e-124 1.02e-159 1.17e-196 1.32e-234
1.81e+11 710.52 1.02e-11 8.15e-29 3.32e-48 2.42e-69 7.01e-92 1.38e-115
TABLE II: Range of across different values of the blocklength , superpolynomial case.
Blocklength
25 50 75 100 125 150 175 200
1.66e-12 7.27e-39 1.58e-70 3.02e-105 4.59e-142 1.86e-180 4.32e-220 9.57e-261
TABLE III: Range of across different values of the blocklength , logarithmic case.

Table I, II and III show the values of . The length of the interval for different values of and parameters of is an indicator of how precisely we can predict the value of relative to the nominal value . We verify that goes to zero exponentially fast in the short blocklength regime (from 75 samples as seen, for example, in Table III), which implies that the nominal value predicted by Theorem 1, i.e., , is a very precise approximation for the optimal Type II error in the finite length regime. Comparing these values, we see that the precision of the result measured by is affected by the velocity at which the Type I error sequence tends to zero. For faster velocity of convergence for , the gap between the bounds is considerable, which means that the results presented in Eq. (19) and (20) are not informative for very small number of the observations. This is an issue that can be attributed to the constant

a worse-case constant that really affect the precision of the bound for small number of samples. Nevertheless, this gap is not critical because after a reasonable number of observations the precision of the bounds goes to zero with an exponential decreasing behaviour.

V Summary and Final Remarks

This paper explored the problem of testing against independence with one-sided communication constraints. More specifically, the scenario of two memoryless sources is considered where one of the modalities is transmitted to the decision maker over a rate-limited channel. In this context, we explored a general family of optimal tests (in the sense of Neyman-Pearson) where restrictions on the Type I error are imposed and we are interested in the velocity at which the Type II error vanishes with the sample size. From a theoretical perspective, we obtained the performance limits for a rich family of problems with a decreasing sequence of Type I error probabilities. Our main result (Theorem 1) stipulates that the error exponent of the Type II error tends to a fundamental limit in the form of the classical Stein’s Lemma. This result is expressed in a closed-form, function of the operational coding-rate imposed on one of the information sources. Interestingly, the results show that for a large family of Type I error restriction (vanishing to zero with the number of samples), the error exponent is independent of the vanishing restriction and equivalent to the result obtained in the more classical setting where the error exponent is constant (with ) and greater than zero (Lemma 3).

The finite simple-size regime was also investigated. Our second main result (Theorem 2) addresses the problem of characterizing the non-asymptotic error probability of Type II. Using results from rate distortion theory and concentration inequalities, we obtained upper and lower bounds for this error as a function of (the number samples), the sequence that models the restriction for the Type I error and the involved probabilities. Interestingly, we observe that the non-asymptotic bounds offer an interval of feasibility for the optimal Type II error, which presents a very precise description. A closed-form expression for the worse-case Type II error was derived where a discrepancy in the error exponent with respect to the asymptotic error exponent limit was identified. This discrepancy (overhead) can be attributed to the use of finite number of samples in the decision. Furthermore, this penalization tends to zero at a velocity that is function of , and consequently, we observed the effect of the Type I error restriction not present in Theorem 1.

We have shown that the worse-case finite length error is arbitrary close (with ) to the nominal value predicted by the asymptotic result , where is the limit obtained from Theorem 1 and that the precision of the result measured by the length of the interval of feasibility goes to zero exponentially fast. Numerical analysis in some concrete scenarios confirms the predicted quality of the non-asymptotic results presented in Theorem 2.

Vi Proofs

Vi-a Theorem 1:

The proof is divided in two parts: a lower and an upper bound result. We begin with the following bound that extend the result presented by Ahlswede & Csiszár [9, Theorem 3].

Theorem 3.

Let us assume that for all and that is , then

(21)
Proof.

For an arbitrary encoder of rate , let us consider the corresponding optimal decision regions -according to Neyman-Pearson’s Lemma- on the one-sided quantized space expressed by

(22)

is parametrized in terms of , and . Let us denote by the induced test (or decision rule) such that . Then the type I error probability for the pair is given by

(23)

By construction of the pair an upper bound for the Type II is obtained by

(24)

Then, for any finite and , finding an achievable Type II error exponent from this construction (and the bound in Eq.(24)) reduces to solve the following problem:

(25)

Note that breaks the i.i.d. structure of the problem, then determining is not a simple task. We will derive a lower bound for using a finite block analysis approach. For this, let us consider a fixed and let us consider an encoder of length , i.e. . The idea is to decompose in segments of finite length to use the induced block i.i.d. structure when tends to infinity. More precisely, we construct an encoder that we denote by applying the function -times to every sub-block of length

, considering for the moment that

, i.e., ~f_n,l(x_1,…,x_l,x_l+1,…,x_2l,…,x_l(k-1)+1,…,x_kl)≡(~f_l(x_1,…,x_l),~f_l(x_l+1,…,x_2l),…,~f_l(x_l(k-1)+1,…,x_kl)). In the use of the set in (22), it will be convenient to parametrize with respect to the reference value
obtained from the function in the context of our problem. More precisely, let us define

for any . Using the -block product structure, the Type I error of the pair can be expressed by the following deviation event:

(26)

where has the elements satisfying that

(27)

where

in expression (27) denotes the empirical divergence. We will make use of a concentration inequality to bound the probability of the deviation event in (27). To this end, let us introduce the notation: and

(28)

where it follows that for any and :

(29)

where From the bounded difference inequality [31, Theorem 2.2], we have that

(30)

Finally, from (25) a lower bound for can be obtained from (30) by making (that we denote by in (31)) the solution of the following condition:

(31)

Consequently, we have that