I Introduction††This work is supported in part by the European Research Council (ERC) through Starting Grant BEACON (agreement #677854).
In their seminal paper 
, Ahlswede and Csiszár studied a distributed binary hypothesis testing (HT) problem for the joint probability distribution of two correlated discrete memoryless sources. In their setting, one of the sources, denoted by, is observed directly at the detector, which performs the test, and the other, denoted by , needs to be communicated to the detector from a remote node, referred to as the observer, over a noiseless channel with a transmission rate constraint. Given that independently drawn samples are available at the respective nodes, the two hypotheses are represented using the following null and alternate hypotheses:
The objective is to study the trade-off between the transmission rate, and the type I and type II error probabilities in HT. This problem has been extensively studied thereafter
The objective is to study the trade-off between the transmission rate, and the type I and type II error probabilities in HT. This problem has been extensively studied thereafter[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]. Also, several interesting variants of the basic problem have been considered which includes extensions to multi-terminal settings [15, 16, 17, 18, 19], HT under security or privacy constraints [20, 21, 22, 23], HT with lossy compression , HT in interactive settings [25, 26, 27], HT with successive refinement , to name a few.
In this work, we revisit the setting shown in Fig. 1 which has been considered previously in . In here, the communication from the observer to the detector happens over a discrete memoryless channel (DMC). Denoting the transition probability matrix of the DMC by , the channel output given the input is generated according to the probability law . The observer encodes its observations according to the map111In , we allow bandwidth mismatch, i.e., the encoder map is given by , where and are positive integers satisfying for some fixed . Here, we consider the special case () for simplicity of notation. However, our results extend to any straightforwardly. to the stochastic map , where denotes the set of all probability distributions over . The detector outputs the decision according to the stochastic map , where and denotes the set of all probability distributions over support
. Denoting the true hypothesis as the random variable (r.v.), the type I and type II error probability for a given encoder-decoder pair, , are given by
respectively. In  and , the goal is to obtain a computable characterization of the optimal type II error exponent (henceforth referred to as the error-exponent), i.e., the maximum asymptotic value of the exponent of the type II error probability, for a fixed non-zero constraint,
, on the type I error probability. We next define the trade-off studied in more precisely.
An error-exponent is -achievable if there exists a sequence of encoding functions and decision rules such that
For , let
It is well known that since the quantity of interest is the type II error-exponent, can be restricted to be a deterministic map without any loss of generality (see [22, Lemma 3]). The decision rule can then be represented as for some , where denotes the indicator function.
It is shown in  that has an exact single-letter characterization for the special case known as testing against independence (TAI), in which, factors as a product of marginals of , i.e., . To state the result, let denote the capacity of the channel , and let
It is proved in [11, Proposition 7] that
In this paper, we show the strong converse for the above result, namely, that
This result completes the characterization of in terms of for all values of , and extends the strong converse result proved in [1, Proposition 2] for the special case of rate-limited noiseless channels. However, it is to be noted that while the strong converse proved in  holds for all hypothesis tests given in (1), our result is limited to TAI.
Before delving into the proof, we briefly describe the technique and tools used in  to prove the strong converse, and highlight the challenges of extending their proof to the noisy channel setting. The key tools used to prove [1, Proposition 2] are the so-called blowing-up lemma  and a covering lemma . However, it can be seen from the proof therein that the application of the covering lemma to prove the strong converse relies crucially on the fact that the channel from the encoder to the detector is noiseless (i.e. deterministic). Thus, it is not possible to directly follow their technique in our noisy channel setting and arrive at the strong converse result. Alternatively, we will use a change of measure technique introduced in , in conjunction with the blowing-up lemma to arrive at our desired result.
The change of measure technique by itself does not appear sufficient for proving a strong converse in our setting. This is so because a critical aspect for the technique to work is to find a (decoding) set of non-vanishing probability (with respect to ) under the null hypothesis such that for a given
) under the null hypothesis such that for a givensatisfying the type I error probability constraint and each , with probability one (or tending to one with ), where . Note that in the noiseless channel case, the set satisfying the above conditions can be obtained by simply taking
as is done in  for a deterministic . However, this is no longer possible when the channel is noisy. To tackle this issue, we first obtain a set of sufficiently large probability under the null hypothesis such that for each , with a positive probability bounded away from zero. The blowing-up lemma then guarantees that it is possible to obtain a modified decision region such that uniformly for each , with an overwhelmingly large probability. This enables us to prove the strong converse in our setting via the technique in .
 Let be independent r.v.’s taking values in a finite set . Then, for any set with ,
Lemma 3 stated below is a characterization of in terms of hyper-planes in the error exponent-capacity region.
Ii Main result
The main result of the paper is stated next. We will assume that the channel transition matrix has non-zero entries, i.e.,
Let and denote an encoder-decoder pair specified by and , respectively, that satisfies (4b).
Constructing reliable decision regions and :
Note that can be written in the form
Then, it follows from (4b) that for sufficiently large ,
where is a function (that will be optimized later) such that . It follows from Lemma 2 that
for every , since
Also, for any , using (9) we can write that
where, (26) follows since for each and ,
and (27) is due to the inequality .
Let the new decision rule be given by , where
Note that it follows from (27) that
Change of measure via construction of a truncated
We now use the change of measure technique in  by considering the new decision rule (with acceptance region for ) to prove the strong converse. To that purpose, define a new truncated distribution
Bounding type II error-exponent via the weak converse:
From (24) and (30), note that the type I error probability for the hypothesis test between distributions and (under the null and alternate hypotheses, respectively), channel input , and decision rule tends to zero asymptotically as . Then, by the weak converse for HT based on the data processing inequality for KL divergence (see , ), it follows that
Next, note that for such that , we have
Similarly, for all , we have
) that the Markov chainholds under , and that . From this, it follows via the data processing inequality that
Thus, we have for any that
Equation (39) follows from (37) and the fact that (which in turn holds due to the Markov chain under distribution ).
Single-letterization of and applying Lemma 3
We will show in Appendix A that single-letterizes, i.e.,
By the Fenchel-Eggleston-Caratheodory theorem , can be restricted to be finite (a function of and ) in the maximization in (43). Thus, the supremum in (43) is actually a maximum. Assuming (42) holds, we can write from (40) that
For a given , , let achieve the maximum in (43). Then, we can write for that
Then, we can write that
Denoting the total variation distance between distributions and by