# On Identifying a Massive Number of Distributions

Finding the underlying probability distributions of a set of observed sequences under the constraint that each sequence is generated i.i.d by a distinct distribution is considered. The number of distributions, and hence the number of observed sequences, are let to grow with the observation blocklength n. Asymptotically matching upper and lower bounds on the probability of error are derived.

There are no comments yet.

## Authors

• 2 publications
• 23 publications
• 6 publications
• ### Exact upper and lower bounds on the misclassification probability

Exact lower and upper bounds on the best possible misclassification prob...
12/03/2017 ∙ by Iosif Pinelis, et al. ∙ 0

• ### Bounds on the Information Divergence for Hypergeometric Distributions

The hypergeometric distributions have many important applications, but t...
02/07/2020 ∙ by Peter Harremoes, et al. ∙ 0

• ### Combinatorial Communication in the Locker Room

The reader may be familiar with various problems involving prisoners and...
08/26/2020 ∙ by Artur Czumaj, et al. ∙ 0

• ### On infinite covariance expansions

In this paper we provide a probabilistic representation of Lagrange's id...
06/19/2019 ∙ by Marie Ernst, et al. ∙ 0

• ### The roll call interpretation of the Shapley value

The Shapley value is commonly illustrated by roll call votes in which pl...
08/08/2018 ∙ by Sascha Kurz, et al. ∙ 0

• ### K-medoids Clustering of Data Sequences with Composite Distributions

This paper studies clustering of data sequences using the k-medoids algo...
07/31/2018 ∙ by Tiexing Wang, et al. ∙ 0

• ### Bound Propagation

In this article we present an algorithm to compute bounds on the margina...
06/24/2011 ∙ by B. Kappen, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Hypothesis testing is a classical problem in statistics where one is given a random observation vector and one seeks to identify the distribution from a given set of distributions that generated it. Pioneering work in classical hypothesis testing include the proof of the optimality of likelihood ratio tests under certain criteria in the Neyman-Pearon Theorem

[1]. Derivation of error exponents of different error types and their trade-offs for binary and M-ary hypothesis testing in [2] and [3] and the analysis of sequential hypothesis testing in [4].

The classical identification problem, which includes hypothesis testing as a special case, is consist of a finite number of distinct sources, each generating a sequence of i.i.d samples. The problem is to find the underlying distribution of each sample sequence, given the constraint that each sequence is generated by a distinct distribution. With this constraint the number of hypothesis is exponential in the number of distributions. If one neglects the fact that the sequences are generated by distinct distributions, the problem boils down to multiple M-ary hypothesis testing problems. This approach is suboptimal as it fails to exploit some of the (possibly useful) constraints.

In [5], the authors study the the Logarithmically Asymptotically Optimal (LAO) Testing of identification problem for a finite number of distributions. In particular, they study the identification of only two different objects in detail and find the reliability matrix, which consist of the error exponents of all error types. Their optimality criterion is to find the largest error exponent for a set of error types for given values of the other error types error exponent. The same problem with a different optimality criterion was also studied in [6], where multiple, finite, sequences were matched to the source distributions. More specifically, they proposed a test for a generalized Neyman-Pearson-like optimality criterion to minimize the rejection probability given that all other error probabilities decay exponentially with a pre-specified slope.

In here, we assume sequences of length are generated i.i.d according to distinct distributions; in particular random vectors , for some unknown permutation of the distributions. The goal is to reliably identify the permutation with vanishing error probability as from an observation of . This problem has close ties with de-anonymization of anonymized data [6]. A different motivation is the identification of users using only channel output sequences, without the use of pilot / explicit identification signals [7]. In both scenarios, the problem’s difficulty increases with the number of users. In addition, in modeling the systems with a massive number of users (such as the Internet of Things), it may be reasonable to assume that the number of users grow with the transmission blocklength [7][8], and that the user’s identities must be distinguished from the received data. As the result, it is useful to understand exactly how the number of distributions affects the system performance, in particular for the case that the cardinality of the distributions grows with the blocklength. Notice that in this scenario, the number of hypothesis, would be doubly exponential in blocklength and the analysis of the optimal decoder becomes much harder than the classical (with constant number of distributions) identification problems.

Contributions. In this paper, we consider the identification problem for the case that the number of distributions grow with the observation blocklength as motivated by the massive user identification problem in the Internet of Things paradigm. The key novel element in this work consist of analyzing and reducing the complexity of the optimal maximum likelihood decoder, with double exponential number of hypothesis, using a graph theoretic result. In particular, we find

1. Find matching upper and lower bounds on the probability of error. This result specifies the relation between the growth rate of the number of distributions and the pairwise distance of the distributions for reliable identification.

2. We show that the probability that more than two distributions are incorrectly identified is dominated by the probability of the event that only two distributions are incorrectly identified.

3. We show that the arithmetic mean of the cycles gains (where we define the cycle gain as the product of the edge weights within the cycle) in a graph can be upper bounded by a function of the sum of the squares of the edge weights. This may be of independent interest.

## Ii Notation

Capital letters represent random variables that take on lower case letter values in calligraphic letter alphabets. For a set of finite alphabet

, we use to denote the set of all possible distributions on . A vector of length is defined by . When all elements of the random vector are generated i.i.d according to distribution , we denote it as . We use , where , to denote the set of all possible permutations of a set of elements. For a permutation , denotes the -th element of the permutation. is used to denote the remainder of divided by . The indicator function of event is denoted by . We use the notation when .

is the complete graph with nodes with edge index and edge weights . We may drop the edge argument and simply write when the edge specification is not needed. A cycle of length in may be interchangeably defined by a vector of vertices as or by a set of edges where is the edge between and is that between . With this notation, is then used to indicate the -th vertex of the cycle . is used to denote the set of all cycles of length in the complete graph . The cycle gain, denoted by , for cycle is the product of the edge weights within the cycle , i.e., .

## Iii Problem formulation

Let consist of distinct distributions and also let

be uniformly distributed over

, the set of permutations of elements. In addition, assume that we have independent random vectors of length each. For , a realization of , assign the distribution to the random vector . After observing a sample of the random vector , we would like to identify . More specifically, we are interested in finding a permutation to indicate that . Let .

The average probability of error for the set of distributions is given by

 P(n)e =P[^Σ≠Σ] =1(A)!∑σ∈SAP[^Σ≠σ|Xnii.i.d\scalebox1.5[1]$∼$Pσi,∀i∈[1:A]] =P[^Σ≠[1:A]∣∣H(1,…,A)]. (1)

where .

We say that a set of distributions are identifiable if .

###### Theorem 1.

A sequence of distributions are identifiable iff

 limn→∞∑1≤i

where is the Bhattacharya distance between the distributions and .

###### Proof.

As it is obvious from the result of Theorem 1, for the case that is a constant or the case that , the sequence of distributions in are always identifiable and the probability of error in the identification problem decays to zero as the blocklength goes to infinity. The interesting aspect of Theorem 1 is in fact in the regime that increases exponentially with the blocklength.

To prove Theorem 1, we provide upper and lower bounds on the probability of error in the following subsections.

### Iii-a Upper bound on the probability of error

We use the optimal Maximum Likelihood (ML) decoder which minimizes the average probability of error, given by

 ^σ(xn1,…,xnAn):=argmaxσ∈SAnAn∑i=1log(Pσi(xni)), (2)

where . The average probability of error associated with the ML decoder can also be written as

 P(n)e=P[^Σ≠[1:An]∣∣H(1,…,An)] =P⎡⎣⋃^σ≠[1:An]^Σ=^σ∣∣H(1,…,An)⎤⎦ =P⎡⎢ ⎢ ⎢ ⎢ ⎢⎣An⋃r=2⋃^σ:{∑Ani=11{^σi≠i}=r}^Σ=^σ∣∣H(1,…,An)⎤⎥ ⎥ ⎥ ⎥ ⎥⎦ (3) =P[An⋃r=2⋃c^σ:{∑Ani=11{^σi≠i}=r}An∑i=1logP^σiPi(Xni)≥0∣∣H(1,…,An)] (4)

where and where (3) is due to the requirement that each sequence is distributed according to a distinct distribution and hence the number of incorrect distributions ranges from . Equation (4) is also the consequence of the ML decoder defined in (2). In order to avoid considering the same set of error events multiple times, we incorporate a graph theoretic interpretation of in (4). Consider the two sequences and for which we have

 {An∑i=11{^σi≠i}=r∑j=11{^σij≠ij}=r}.

These two sequences in (4) in fact indicate the event that we have (incorrectly) identified instead of the (true) distribution . For a complete graph , the set of edges between in would produce a single cycle of length or a set of disjoint cycles with total length . However, we should note that in the latter case where the sequence of edges construct a set of (lets say of size ) disjoint cycles (each with some length for such that ), then those cycles and their corresponding sequences are already taken into account in the (union of) set of error events.

As an example, assume and consider the error event

 logP2P1(Xn1)+logP1P2(Xn2)+logP4P3(Xn3)+logP3P4(Xn4)≥0,

which corresponds to the (error) event of choosing over with errors. In the graph representation, this gives two cycles of length each, which correspond to

 logP2P1(Xn1)+logP1P2(Xn2)≥0 ∩ logP4P3(Xn3)+logP3P4(Xn4)≥0,

and are already accounted for in the events with .

As the result, in order to avoid double counting, in calculating the value of (4) for each we should only consider the sets of sequences which produce a single cycle of length . Hence, we can upper bound the probability of error in (4) as (where we drop the conditioning for ease of notation)

 P(n)e ≤An∑r=2∑c∈C(r)AnP[r∑i=1logP⌊c(v)(i+1)⌋rPc(v)(i)(Xnc(v)(i))≥0] ≤An∑r=2∑c∈C(r)Ane−n∑ri=1B(Pc(v)(i),Pc(v)(⌊i+1⌋r)) (5) =An∑r=2∑c∈C(r)AnG(c), (6)

where enumerates the number of incorrect matchings and where is the -th vertex in the cycle . The inequality in (5) is by

 P[r∑i=1logP⌊c(v)(i+1)⌋rPc(v)(i)(Xnc(v)(i))≥0] ≤exp⎧⎨⎩ninftlogE⎡⎣r∏i=1(Pc(v)(⌊i+1⌋r)Pc(v)(i)(Xnc(i)))t⎤⎦⎫⎬⎭ ≤exp⎧⎨⎩nr∑i=1logE⎡⎢⎣(Pc(v)(⌊i+1⌋r)Pc(v)(i)(Xnc(i)))1/2⎤⎥⎦⎫⎬⎭ (7) =exp{−nr∑i=1B(Pc(v)(i),Pc(v)(⌊i+1⌋r))}.

In (6), we have also defined to be the edge weight between vertices in the complete graph . Hence is the gain of cycle .

The fact that we used in (7) instead of finding the exact optimizing , comes from the fact that is the optimal choice for and as we will see later, the rest of the error events are dominated by the set incorrect distributions. This can be seen as follows for

 P[logP1P2(Xn2)+logP2P1(Xn1)≥0] =∑^P1,^P2:∑x∈X^P1(x)logP2(x)P1(x)+^P2(y)logP1(x)P2(x)≥0exp{−nD(^P1∥P1)−nD(^P2∥P2)} ≐e−nD(~P∥P1)−nD(~P∥P2)=e−2nB(P1,P2), (8)

where in the first equality in (8), by using the Lagrangian method, can be shown to be equal to and subsequently the second inequality in (8) is proved.

In order to further simplify the expression in (6), we use the following graph theoretic Lemma, the proof of which is given in the Appendix.

###### Lemma 1.

In a complete graph and for the set of cycles of length we have

 1Nr,k(G(c1)+…G(cNr,k)) ≤(a21+…+a2nknk)r2

where are the number of cycles of length and the number of edges in the complete graph , respectively.

By Lemma 1 and (6) we can write

 P(n)e ≤An∑r=2∑c∈C(r)AnG(c) ≤An∑r=2Nr,An(nAn)r2(a21+…+a2nAn)r/2 ≤An∑r=24r⎛⎝∑1≤i

where (9) is by Fact 1 (see Appendix) and

 Nr,An(nAn)r/2=(Anr)(r−1)!/2((An2))r/2≤4r.

The upper bound on the probability of error in (10) goes to zero if

 limn→∞∑1≤i

As a result of Lemma 1, it can be seen from (9) that the sum of probabilities that distributions are incorrectly identified is dominated by the probability that only distributions are incorrectly identified. This shows that the most probable error event is indeed the error events with two wrong distributions.

### Iii-B Lower bound on the probability of error

For our converse, we use the optimal ML decoder, and as a lower bound to the probability of error in (4), we only consider the set of error events with only two incorrect distributions, i.e. the set of events with . In this case we have

 P(n)e ≥P⎡⎣⋃1≤i

where (11) is by [9] and where

 ξi,j:={logPiPj(Xnj)+logPjPi(Xni)≥0}. (12)

We upper bound the denominator of (11) by

 P[ξi,j,ξi,k]=P[logPiPj(Xnj)+logPjPi(Xni)≥0 ∩ logPiPk(Xnk)+logPkPi(Xni)≥0] ≤P[logPiPj(Xnj)+logPjPi(Xni) +logPiPk(Xnk)+logPkPi(Xni)≥0] ≤exp{ninft log(E[(PiPj(Xnj)⋅PjPi(Xni)⋅PiPk(Xnk)⋅PkPi(Xni))t])} ≤exp⎧⎨⎩nlogE⎡⎣(PiPj(Xnj)⋅PjPi(Xni)⋅PiPk(Xnk)⋅PkPi(Xni))12⎤⎦⎫⎬⎭ =exp{−nB(Pi,Pj)−nB(Pj,Pk)−nB(Pi,Pk)}. (13)

An upper bound for can be derived accordingly. By substituting (8) and (13) in (11) we have

 P(n)e≥ ≥(∑i,je−2nB(Pi,Pj))28(∑1≤i

where (14) is by Lemma 1. As it can be seen from (15), if , the probability of error is bounded away from zero. As the result, we have to have , which also matches our upper bound on probability of error in (10). ∎

## Iv Conclusion

In this paper, we generalized the identification problem to the case that the number of distributions grows with the blocklength . We found matching upper and lower bounds on the probability of identification error. This result characterizes the relation between the number of distributions and the pairwise distance of the distributions for reliable identification.

We first consider the case that r is an even number and then prove

 r(nk)r2−1(G(c1)+…G(cNr,k))≤Nr,krnk(a21+…+ank2)r2. (16)

We may drop the subscripts and use and in the following for notational ease. Our goal is to expand the right hand side (RHS) of (16) such that all elements have coefficient . Then, we parse these elements into different groups (details will be provided later) such that using the AM-GM inequality (i.e., ) on each group, we get one of the terms on the LHS of (16). Before stating the rigorous proof, we provide an example of this strategy for the graph with vertices shown in Fig. 1. In this example, we consider the Lemma for cycles (for which we have ).

We may expand the RHS in (16) as

 2(a21+…+a26)2=Θ1+Θ2+Θ3, Θ1={a41+a42+a43+a44+a21a23+a21a23+a22a24+a22a24 +a21a22+a21a22+a21a22+a21a22+a21a24+a21a24+a21a24+a21a24 +a22a23+a22a23+a22a23+a22a23+a23a24+a23a24+a23a24+a23a24} Θ2={a41+a46+a43+a45+a25a26+a25a26+a21a23+a21a23 +a21a26+a21a26+a21a26+a21a26+a21a25+a21a25+a21a25+a21a25 +a23a26+a23a26+a23a26+a23a26+a23a25+a23a25+a23a25+a23a25} Θ3={a44+a45+a42+a46+a25a26+a25a26+a22a24+a22a24 +a24a25+a24a25+a24a25+a24a25+a24a26+a24a26+a24a26+a24a26 +a22a25+a22a25+a22a25+a22a25+a22a26+a22a26+a22a26+a22a26}.

It can be easily seen that if we use the AM-GM inequality on , and , we can get the lower bound equal to and , respectively where and hence (16) holds in this example.

We proceed to prove Lemma 1 for arbitrary and (even) . We propose the following scheme to group the elements on the RHS of (16) and then we prove that this grouping indeed leads to the claimed inequality in the Lemma.

Grouping scheme: For each cycle , we need a group of elements, , from the RHS of (16). In this regard, we consider all possible subsets of the edges of cycle with elements (e.g. ). For each one of these subsets, we find the respective elements from the RHS of (16) that is the multiplication of the elements in that subset. For example, for the subset , we consider the elements like for all possible from the RHS of (16). However, note that we do not assign all such elements to cycle only. If there are cycles of length that all contain , we should assign of the elements like to cycle (so that we can assign the same amount of elements to other cycles with similar edges).

We state some facts, which can be easily verified:

Fact 1. In a complete graph , there are cycles of length .

Fact 2. By expanding the RHS of (16) such that all elements have coefficient , we end up with elements.

Fact 3. Expanding the RHS of (16) such that all elements have coefficient , and finding their product yields

 (a1×…×an)(Nrn)rnr2−1.

Fact 4. In above grouping scheme each element on the RHS of (16) is summed in exactly one group. Hence, by symmetry and Fact 2, each group is the sum of elements.

Now, consider any two cycles . Assume that using the above grouping scheme, we get the group of elements (where by fact 3 each one is the sum of elements). If we apply the AM-GM inequality on each one of the two groups, we get