Distributed Detection with Empirically Observed Statistics

We consider a binary distributed detection problem in which the distributions of the sensor observations are unknown and only empirically observed statistics are available to the fusion center. The source (test) sequences are transmitted through different channels to the fusion center, which also observes noisy versions of labelled training sequences generated independently from the two underlying distributions. The fusion center decides which distribution the source sequence is sampled from based on the observed statistics, i.e., the noisy training data. We derive the optimal type-II error exponent given that the type-I error decays exponentially fast. We further maximize the type-II error exponent over the proportions of channels for both source and training sequences and conclude that as the ratio of the lengths of training to test sequences tends to infinity, using only one channel is optimal. Finally, we relate our results to the distributed detection problem studied by Tsitsiklis.

Authors

• 2 publications
• 17 publications
• 65 publications
• Anonymous Heterogeneous Distributed Detection: Optimal Decision Rules, Error Exponents, and the Price of Anonymity

We explore the fundamental limits of heterogeneous distributed detection...
05/09/2018 ∙ by Wei-Ning Chen, et al. ∙ 0

• Evaluation of Error Probability of Classification Based on the Analysis of the Bayes Code

Suppose that we have two training sequences generated by parametrized di...
10/08/2019 ∙ by Shota Saito, et al. ∙ 0

• Adversarial Source Identification Game with Corrupted Training

We study a variant of the source identification game with training data ...
03/27/2017 ∙ by Mauro Barni, et al. ∙ 0

• Second-Order Asymptotically Optimal Statistical Classification

Motivated by real-world machine learning applications, we analyze approx...
06/03/2018 ∙ by Lin Zhou, et al. ∙ 0

• Optimal Resolution of Change-Point Detection with Empirically Observed Statistics and Erasures

This paper revisits the offline change-point detection problem from a st...
03/13/2020 ∙ by Haiyun He, et al. ∙ 0

• K-medoids Clustering of Data Sequences with Composite Distributions

This paper studies clustering of data sequences using the k-medoids algo...
07/31/2018 ∙ by Tiexing Wang, et al. ∙ 0

• Sequential Classification with Empirically Observed Statistics

Motivated by real-world machine learning applications, we consider a sta...
12/03/2019 ∙ by Mahdi Haghifam, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In the traditional distributed detection problem, the underlying data generating distributions are available at the fusion center and one is tasked to design a test based on observations as well as the known distributions. However, in practical applications, the fusion center has no knowledge of the underlying distributions and may only be given compressed, quantized or noisy observations labelled training sequences. This leads to new challenges in designing an optimal decision test.

Motivated by these practical issues and inspired by [1, 2], we consider the distributed detection problem as shown, for the binary case, in Figure 1 in which the distributions of sensor observations are unknown. We term this problem as distributed detection with empirically observed statistics. We assume that the sensor observations are transmitted to the fusion center via different channels, which can also be regarded as compressors. Labelled training sequences generated from the different underlying distributions are pre-processed then provided to the fusion center. Our aim is to derive the fundamental performance limits of the classification problem as well as to potentially come up with the same conclusions as Tsitsiklis did in [2], i.e., conclude that a small number of distinct channels or local decision rules suffices to attain the optimal error exponent.

I-a Main Contributions

In this paper, our main contributions are as follows.

Firstly, for the binary distributed detection problem, we derive the asymptotically optimal type-II error exponent when the type-I error exponent is lower bounded by a positive constant. In the achievability proof, we introduce a generalized version of Gutman’s test in [1] and prove that the so-designed test is asymptotically optimal.

Secondly, again restricting ourselves to the binary case, we discuss the optimal proportions of different channels that serve as pre-processors of the training and source sequences. Let , a constant, denote the ratio between the length of the training sequence and the length of the source sequence. When , we provide a closed-form expression for the type-II error exponent and prove that using only one identical channel for both training and source sequences is asymptotically optimal. This mirrors Tsitsiklis’ result [2]. On the other hand, if is sufficiently small, the type-II error exponent is identically equal to zero. When does not take extreme values, by calculating the derived exponent numerically, we conjecture that using one channel for the training sequence and another (possibly the same one) for the source sequence is optimal.

Thirdly, we relate our results to the classical distributed detection problem in Tsitsiklis’ paper [2]. When

, the true distributions can be estimated to arbitrary accuracy and we naturally recover the results in

[2] for both the Neyman-Pearson and Bayesian settings.

Finally, we extend our analyses to consider an -ary distributed detection problem with the rejection option. We derive the asymptotically optimal type- rejection exponent for each under the condition that all (undetected) error exponents are lower bounded by a positive constant . In the achievability proof, we introduce a generalized version of Unnikrishnan’s test [3]

by identifying an appropriate test statistic. We prove that the so-designed test is asymptotically optimal.

I-B Related Works

The distributed detection literature is vast and so it would be futile to review all existing works. This paper, however, is mainly inspired by [1] and [2]. In [1], Gutman proposed an asymptotically optimal type-based test for the binary classification problem. In [2], Tsitsiklis showed that using distinct local decision rules is optimal for -ary hypotheses testing in standard Bayesian and Neyman-Pearson distributed detection settings. Ziv [4] proposed a discriminant function related to universal data compression in the binary classification problem with empirically observed statistics. Chamberland and Veeravalli [5] considered the classical distributed detection in a sensor network with a multiple access channel, capacity constraint and additive noise. Liu and Sayeed [6] extended the type-based distributed detection to wireless networks. Chen and Wang [7] studied the anonymous heterogeneous distributed detection problem and quantified the price of anonymity. Tay, Tsitsiklis and Win studied tree-based variations of the distributed detection problem in the Bayesian [8] and Neyman-Pearson settings [9]. The authors also studied Bayesian distributed detection in a tandem sensor network[10]. The aforementioned works assume that the distributions are known.

Nguyen, Wainwright and Jordan[11] proposed a kernel-based algorithm for the nonparametric distributed detection problem with communication constraints. Similarly, Sun and Tay[12] also studied nonparametric distributed detection networks using kernel methods and in the presence of privacy constraints. While the problem settings in [11] and [12] involve training samples, the questions posed there are algorithmic in nature and hence, different. In particular, they do not involve fundamental limits in the spirit of this paper.

I-C Paper Outline

The rest of this paper is organized as follows. In Section II, we formulate the distributed detection problem with empirically observed statistics. We also present the optimal type-II error exponent and analyze the optimal proportion of channels and recover analogues of the results in [2] both for Neyman-Pearson and Bayesian settings. In Section III, we extend our results to the case in which there are hypotheses and the rejection option is present. We conclude our discussion and present avenues for future work in Section IV. The proofs of our results are provided in the appendices.

I-D Notation

Random variables and their realizations are in upper (e.g., ) and lower case (e.g., ) respectively. All sets are denoted in calligraphic font (e.g., ). We use to denote the complement of . Let

be a random vector of length

. All logarithms are base . Given any two integers , we use to denote the set of integers and use to denote

. The set of all probability distributions on a finite set

is denoted as

and the set of all conditional probability distributions from

to is denoted as . Given and , we use to denote the marginal distribution on induced by and . We denote the support of as . Given a vector , the type or empirical distribution [13] is denoted as where . We interchangeably use and to denote the type class of . Let denote the set of types with denominator . For two positive sequences and , we write if . The notations and are defined similarly. For a given vector , we let denote the support set of .

Ii Binary Distributed Detection with Training Samples

In this section, we formulate the problem in which there are two hypotheses and instead of distributions, only training samples are available.

Ii-a Problem Formulation

We assume that there are fixed compressors or channels (these are called local decision rules in [2]), where for each , the -th channel is . This channel has input alphabet and output alphabet . For notational simplicity, we assume that but our results go through for uncountably infinite as well. We let be a fixed set of channels. Furthermore, let and to be functions that map the index of the test/training sample to the channel index.

The system model is as follows (see Figure 1). There are sensors and a source/test sequence generated i.i.d. according to some unknown distribution defined on . For each , the -th sensor observes and maps it to using the channel . The ’s from all local sensors are transmitted to a fusion center. In addition to ’s, the fusion center observes two noisy versions of training sequences which are generated i.i.d. according to some unknown but fixed distributions . The fusion center observes noisy sequences , where and for all , and is another channel, not necessarily in . With and , the fusion center uses a decision rule to discriminate between the following two hypotheses:

• : the source sequence and the training sequence are generated according to the same distribution;

• : the source sequence and the training sequence are generated according to the same distribution.

In our setting, we assume that the fusion center only uses one channel to pre-process the second training sequence . The reason for this is that the optimal test we consider in (9) to follow depends only on and . Nonetheless, needs to satisfy a technical assumption (see Assumption 1).

We assume for some .111We ignore the integer constraints of and write . For each , we use and to denote the proportions of and in which the channel is used to process the source and training sequences respectively, i.e.,

 a(n)j:=∑i∈[n]1{h(i)=j}n,b(n)j:=∑i∈[αn]1{g(i)=j}αn. (1)

An example is given in Figure 2. Furthermore, we let and . We assume that the following limits exist:

 aj:=limn→∞a(n)j,bj:=limn→∞b(n)j,∀j∈[K]. (2)

To avoid clutter in subsequent mathematical expressions, we abuse notation subsequently and drop the superscript in and in all non-asymptotic expressions, with the understanding that (resp. ) appearing in a non-asymptotic expression should be interpreted as (resp. ).

Given any decision rule at the fusion center and any pair of distributions according to which the training sequences are generated, the performance metrics we consider are the type-I and type-II error probabilities

 βν(γ,P1,P2|a,b,V,W):=Pν{γ(Zn,~YN1,~YN2)≠Hν}, (3)

where for , we use

to denote the joint distribution of

and under hypothesis . In the remainder of this paper, we use to denote if there is no risk of confusion.

Inspired by [1], in this paper, we are interested in the maximal type-II error exponent with respect to a pair of target distributions for any decision rule at the fusion center whose type-I error probability decays exponentially fast with a certain fixed exponential rate for all pairs of distributions, i.e., given any , the optimal non-asymptotic type-II error exponent is

 E∗(n,α,P1,P2,λ|a,b,V,W) :=sup{E∈R+:∃ γ s.t. β2(γ,P1,P2)≤exp(−nE) and β1(γ,~P1,~P2)≤exp(−nλ), ∀(~P1,~P2)∈P(X)2}. (4)

Ii-B Assumption and Definitions

In this subsection, we state an assumption on the channel  and several definitions to present our results succinctly.

In most practical distributed detection systems, the local decision rule at each sensor is a deterministic compressor or quantizer. However, under certain conditions, randomized local decision rules can be used to provide privacy[14, 15, 16] or to satisfy power constraints[17, Sec. IV]. We generally allow and the ’s to be random with the following restriction on . Let be the set of stochastic matrices (channels) with rows and columns whose rows contain a permutation of the rows of , the identity matrix.

Assumption 1.

.

The set includes all deterministic mappings and a subset of stochastic mappings as long as for each , there exists an that maps directly to it, as illustrated in Figure 3(a). Note that Tsitsiklis [2] considers only deterministic local decision rules, i.e., deterministic channels as in Figure 3(b). The definition is extended in the obvious way if (i.e., for all , there exists such that ). There is no restriction on the channels ’s that are used to pre-process the source sequence and first training sequence . This assumption on is used in the converse proof of Theorem 1.

Given any pair of distributions and any , the generalized Jensen-Shannon divergence [18, Eqn. (3)] is defined as

 GJS(~Q,Q,α):=D(Q∥∥Q+α~Q1+α)+αD(~Q∥∥Q+α~Q1+α). (5)

Let and be two collections of distributions. Given any , any , any , any pair , define the following linear combination of divergences

 LD(Q,~Q,P,~P|α,a,b,W):=∑k∈[K](akD(Qk∥PWk)+αbkD(~Qk∥~PWk)), (6)

and given any , define the following set

 Qλ(α,a,b,W):={(Q,~Q)∈P([L])2K:min~P∈P(X)LD(Q,~Q,~P,~P|α,a,b,W)≤λ}. (7)

Ii-C Main Results

The following theorem is our main result and presents a single-letter expression for the optimal type-II exponent.

Theorem 1.

Given any , any pair of target distributions , and ,

 limn→∞E∗(n,λ,α,P1,P2|a,b,V,W)=min(Q,~Q)∈Qλ(α,a,b,W)LD(Q,~Q,P2,P1|α,a,b,W). (8)

The proof of Theorem 1 is given in Appendix -A. Several remarks are in order.

Firstly, in the achievability proof of Theorem 1, we make use of the following test at the fusion center

 γ(Zn,~YN1,~YN2) =⎧⎨⎩H1if min~P∈P(X)LD({TZnak}k∈[K],{T~YNbk1}k∈[K],~P,~P∣∣α,a,b,W)≤λ,H2otherwise, (9)

where for each , we use to denote the collection of where satisfies and similarly for . We show that this rule is asymptotically optimal.

Secondly, Theorem 1 shows that the optimal type-II error exponent is independent of the channel that is used to pre-process the second training sequence . This is not unnatural in view of the form of the test in (9). Indeed, this test depends only on and (and not on ).

Thirdly, the test in (9) is a generalization of Gutman’s test in [1]. To see this, we note that if we let , and consider the deterministic channel denoted as , the test in (9) reduces to Gutman’s test using the data and the exponent in Theorem 1 reduces to the type-II exponent for binary classification [1, Thm. 3], i.e.,

 γ(Zn,~YN1,~YN2)={H1GJS(T~YN1,TZn,α)≤λ,H2otherwise, (10)

and

 limn→∞E∗(n,λ,α,P1,P2|a,b,V,W)=min(Q,~Q)∈P(Z)2:GJS(~Q,Q,α)≤λD(Q∥P2)+αD(~Q∥P1). (11)

Ii-D Further Discussions on (a,b)

For brevity, define

 fα(a,b,λ):=min(Q,~Q)∈Qλ(α,a,b,W)LD(Q,~Q,P2,P1|α,a,b,W). (12)

Since the type-II error exponent depends on , inspired by the result in [2] which states that one local decision rule is optimal for binary hypotheses testing (in the Neyman-Pearson and Bayesian settings), we can further optimize the type-II error exponent with respect to the design of the proportion of channels (encoded in and ) and thus study

 f∗α(λ):=maxa,bfα(a,b,λ) (13)

and the corresponding optimizers and for different values of . For this purpose, given any vector and any distribution , define

 P(~P|v,W):={P∈P(X): s.t. ∀ k∈[K],vk∥PWk−~PWk∥∞=0}. (14)

Note that and if , then for all .

Furthermore, given any , any and any pair , let

 κ(Q,P1|a,b,W):=min~P∈P(P1|b,W)∑k∈[K]akD(Qk∥~PWk). (15)
Lemma 2.

The function satisfies

 limα→∞fα(a,b,λ)=f∞(a,b,λ):=minQ∈P([L])K:κ(Q,P1|a,b,W)≤λ∑k∈[K]akD(Qk∥P2Wk). (16)

The proof of Lemma 2 is provided in Appendix -B.

We say that is deterministic if there exists a such that . Let be the -th standard basis vector in , i.e., the vector equals in the -th location and in other locations.

Corollary 3.

Given any , we have

 (17)

and thus the maximizers for satisfy that are both deterministic and .

The proof of Corollary 3 is provided in Appendix -C. Corollary 3 says that when the length of the training sequence is much longer than the test sequence, it is optimal to use a single local decision rule or channel to pre-process the training data and source sequence; this is analogous to [2, Theorem 1].

Given any and any , let

 Gα(a,b):=min~P∈P(X) ∑k∈[K](akD(P2Wk∥~PWk)+αbk(P1Wk∥~PWk)). (18)

Given any , let be the solution (in ) to the following equation

 λ=Gα(a,b). (19)

Since is an increasing function of and , for any , we have unless .

Lemma 4.

Given any and any , if , then

 fα(a,b,λ)=0. (20)

The proof of Lemma 4 is straightforward since when , where and and thus . The intuition is that when is small enough, for any , the decision rule in (9) always declares , which means that , so the corresponding exponent is identically .

We verified Lemma 4 numerically by plotting as a function of for certain values of in Figure 4.

In the following, we present numerical results to illustrate the properties of , when does not take extreme values. By calculating for various and , we find that when is moderate (i.e., neither nor ), the maximal value of always lies at a corner point of the feasible set of . This is shown numerically in Figure 5 for and some choices of and . We also illustrate this for the case of deterministic in Figure 6 for . Additional numerical results are shown in Appendix -D. Inspired by these numerical results, we present the following conjecture:

Conjecture 5.

For all , the vectors and that maximize are deterministic.

Ii-E Connections to Results in Distributed Detection

We discuss the connections between Theorem 1 and [2], which concerns distributed detection when the underlying distributions are known. The direct parts of the following results are corollaries of Theorem 1, Lemma 2 and Corollary 3 by letting and by solving respectively. The (strong) converse parts follow from [2]. Since the justifications are straightforward, we omit them for the sake of brevity. Throughout this subsection, to emphasize the dependence of error probabilities on , we use to denote the type- error probability with respect to distributions when test is used at the fusion center.

We first consider the Neyman-Pearson setting [19, Sec. 11.8]. Given any , let be the set of tests satisfying that for all ,

 β1(γ,~P1,~P2|a,b)≤ε. (21)

Let the optimal type-II error probability subject to (21) be

 β∗2(P1,P2) :=min(a,b)∈Pn([K])×Pαn([K])minγ∈Γε(a,b)β2(γ,P1,P2|a,b). (22)

Note that depends on , and but this dependence is suppressed for the sake of brevity.

Corollary 6.

Given any and any ,

 limα→∞limn→∞1nlog1β∗2(P1,P2)=maxk∈[K]D(P1Wk∥P2Wk). (23)

We also consider the Bayesian setting. Assume the prior probabilities for

and are and respectively. Clearly, . Given any and any , let the Bayesian error probability be

 Pe(γ,P1,P2|a,b) :=π1β1(γ,P1,P2|a,b)+π2β2(γ,P1,P2|a,b). (24)

Furthermore, let the maximum Chernoff information between and be

 λ∗=maxk∈[K]maxρ∈[0,1]log1∑z(P2Wk)ρ(z)(P1Wk)1−ρ(z). (25)

and let be the set of tests at the fusion center satisfying that for all ,

 β1(γ,~P1,~P2|a,b)≤exp(−nλ∗). (26)

Finally, let the optimal Bayesian error probability be

 P∗e(P1,P2) :=min(a,b)∈Pn([K])×Pαn([K])minγ∈ΓBayes(a,b)Pe(γ,P1,P2|a,b). (27)

Again depends on both and .

Corollary 7.

Given any and any ,

 limα→∞limn→∞1nlog1P∗e(P1,P2)=λ∗. (28)

Under the Bayesian setting, the exponents of the type-I and type-II error probabilities are equal [19, Thm. 11.9.1].

Note that Corollaries 6 and 7 are analogous to distributed detection [2] for the binary case under the Neyman-Pearson and Bayesian settings respectively where the true distributions are known. The intuition is that when the lengths of the training sequences are much longer than that of the source sequence (i.e., ), we can estimate the true distributions to arbitrary precision, i.e., as accurately as desired.

Iii m-ary Distributed Detection with the Rejection Option and Training Samples

In this section, we generalize the binary distributed detection problem to the scenario in which we desire to discriminate between hypotheses with the rejection option. Our main contribution here is the identification of an appropriate test statistic and test that achieves the optimal rejection exponent for a fixed lower bound on all error exponents.

Iii-a Problem Formulation

In the -ary distributed detection problem, there are training sequences , each generated i.i.d. according to an unknown distribution . There are sensors. Each sensor observes a source symbol and compress/processes it into a noisy version just as in the binary distributed detection problem. Given noisy training sequences and the compressed source sequence , in which for all and , the fusion center uses a decision rule to discriminate among the following hypotheses:

• : the source sequence and the -th training sequence are generated according to the same distribution;

• : the source sequence is generated according to a distribution different from those which the training sequences are generated from and hence we reject all .

Thus, the decision rule partitions the sample space into disjoint regions: acceptance regions , where favors hypothesis , i.e.,

 Λj(γ):={(zn,~yN1,…,~yNm)∈[L]mN+n:γ(Zn,~yN1,…,~yNm)=Hj}, (29)

and one rejection region which favors hypothesis . Note that here we assume that all training sequences are processed with channels in using the same index mapping function . That is, all the first components are passed through the same channel, which is one element from . The same is true for all the other components.

For conciseness, we set and use similarly. Furthermore, we set and use and similarly. Recall the definition of and the assumption that . Given any decision rule at the fusion center and any tuple of distributions , the performance metrics we consider are the error probabilities and the rejection probabilities for each , i.e.,

 βj(γ,P|a,b,W) :=Pj{γ(Zn,~YN)∉{Hj,Hr}}, (30) βr,j(γ,P|a,b,W) :=Pj{γ(Zn,~YN)=Hr}. (31)

We use and in place of and respectively if there is no risk of confusion. For this setting, we are interested in tests that can simultaneously ensure exponential decay of the error probabilities under any hypothesis for any tuple of distributions and exponential decay of the rejection probabilities under each hypothesis for a particular tuple of distributions. To be concrete, given any tuple of distributions and any , we are interested in the following optimal exponent of the rejection probability under hypothesis :

 E∗j(n,α,P,λ|a,b,W):=sup{Ej∈R:∃ γ s.t. ∀ j∈[m],βj(γ,~P) ≤exp(−nλ), ∀ ~P∈P(X)m,