# Algorithms and Fundamental Limits for Unlabeled Detection using Types

Emerging applications of sensor networks for detection sometimes suggest that classical problems ought be revisited under new assumptions. This is the case of binary hypothesis testing with independent - but not necessarily identically distributed - observations under the two hypotheses, a formalism so orthodox that it is used as an opening example in many detection classes. However, let us insert a new element, and address an issue perhaps with impact on strategies to deal with "big data" applications: What would happen if the structure were streamlined such that data flowed freely throughout the system without provenance? How much information (for detection) is contained in the sample values, and how much in their labels? How should decision-making proceed in this case? The theoretical contribution of this work is to answer these questions by establishing the fundamental limits, in terms of error exponents, of the aforementioned binary hypothesis test with unlabeled observations drawn from a finite alphabet. Then, we focus on practical algorithms. A low-complexity detector - called ULR - solves the detection problem without attempting to estimate the labels. A modified version of the auction algorithm is then considered, and two new greedy algorithms with O(n^2) worst-case complexity are presented, where n is the number of observations. The detection operational characteristics of these detectors are investigated by computer experiments.

Comments

There are no comments yet.

## Authors

• 4 publications
• 2 publications
• ### Distributed Hypothesis Testing with Concurrent Detections

A detection system with a single sensor and K detectors is considered, w...
05/16/2018 ∙ by Pierre Escamilla, et al. ∙ 0

read it

• ### Distributed Hypothesis Testing with Collaborative Detection

A detection system with a single sensor and two detectors is considered,...
10/08/2018 ∙ by Pierre Escamilla, et al. ∙ 0

read it

• ### Distributed Hypothesis Testing over a Noisy Channel: Error-exponents Trade-off

A distributed hypothesis testing problem with two parties, one referred ...
08/21/2019 ∙ by Sreejith Sreekumar, et al. ∙ 0

read it

• ### True and false discoveries with independent e-values

In this note we use e-values (a non-Bayesian version of Bayes factors) i...
03/01/2020 ∙ by Vladimir Vovk, et al. ∙ 0

read it

• ### Distributed Hypothesis Testing: Cooperation and Concurrent Detection

A single-sensor two-detectors system is considered where the sensor comm...
07/18/2019 ∙ by Pierre Escamilla, et al. ∙ 0

read it

• ### Game Theoretical Approach to Sequential Hypothesis Test with Byzantine Sensors

In this paper, we consider the problem of sequential binary hypothesis t...
09/06/2019 ∙ by Zishuo Li, et al. ∙ 0

read it

• ### Limits of Detecting Extraterrestrial Civilizations

The search for extraterrestrial intelligence (SETI) is a scientific ende...
07/20/2021 ∙ by Ian George, et al. ∙ 0

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction and Motivations

Mostly motivated by emerging applications of sensor networks, recent years have seen the birth of a field that can be referred to as signal processing with unlabeled data

. This terminology refers to the bulk of classical algorithms and methods of signal processing, revisited under the new paradigm of a central unit that must process a vector of data received by certain peripheral units, but must do so – or choose do so – without access to the data

labels, namely without knowing the original position of each datum inside the vector. The meaning here given to “labeling” is that of provenance

and is not to be confused with the labeling obtained by data classification, as typical, for instance, of machine learning applications. Note that in this work we are interested in the first case, that the processing must proceed without labels by necessity; when labeling is avoided as a matter of elegance is usually referred to as the random finite set (RFS) idea, and good entry points are

[2, 3].

As a notional example, suppose that under the null hypothesis two sensors’ observations are independent and identically distributed (

iid) unit-normal; and that under the alternative their means are shifted, respectively by & . The central decision-maker receives the set , and is specifically told that it should make no assumption about which observation came from which sensor. Intuition suggests that the first sensor’s observation is and the second sensor saw ; and hence that there is a fairly decent fit with the alternative hypothesis. How much decision-making performance has been lost by label-agnostic decision-making with respect to label-aware in this case? That is, how much information is in knowing who said what, as opposed simply to knowing what was said? And how about the case that the two mean shifts were respectively & : clearly the quality of the match is much lower; but equally clearly the impact of making a labeling error is far lower.

### I-a Related Work

Modern networks are vulnerable to malicious attacks. For instance, the civilian global positioning system (GPS) is particularly exposed to spoofing attacks [4], which can impair wireless ad-hoc communication systems [5], or alter the timing information in smart grids [6, 7]. As a consequence, the timestamp information of the system may be altered to the point that the data arriving to a central decision unit can be considered unlabeled.

Even in absence of an attack, modern sensor networks and other networked inference/communication systems are similarly vulnerable, especially when faced with big-data applications. Indeed, one challenge of these systems is the possible presence at the fusion center of data partially unordered — the so-called out-of-sequence measurements (OOSM) issue. A prominent example is represented by distributed tracking systems where the data received at the fusion center are partially unordered [8]. Similarly, networked control systems with packetized messages can be subject to various timing errors due to uncontrollable packet delays [9]. In [10] the lack of a precise timestamp of data is considered in connection with the usage of automatic identification system (AIS) in real-world maritime surveillance problems. The common denominator to all these examples is that data must be processed with partial or no information about their relative time/space ordering, which is ofter related to their provenance from a peripheral unit of the network.

A systematic study of the lack-of-provenance issue, which is nowadays referred to as the unlabeled data paradigm, has been prompted by [11, 12]. The authors of [11, 12] consider a signal recovery problem from a set of unlabeled linear projections. They also compare their unlabeled sensing formulation with the setting of compressed sensing (see e.g., [13, 14]), and highlights connections with a classical problem in robotics which is known as simultaneous location and mapping (SLAM) [15]. Very recent studies with a similar data-reconstruction focus can be found in [16, 17, 18, 19, 20].

In contrast to data reconstruction, our focus is on inference by unlabeled data, which has been addressed in the last few years by [21, 22, 23]. In particular, we elaborate on a model similar to that addressed in [23], under the assumption that data are drawn from a finite alphabet. The motivation is that modern applications of large wireless sensor networks frequently impose severe constraints on the delivered messages, due to limited sensors’ resources, e.g., energy, bandwidth, etc. In these applications, to include the identities of the reporting sensors in the delivered messages might constitute an excessive burden [24] and, for the same reasons, the delivered data are usually constrained to belong to a finite alphabet with small cardinality.

### I-B Contribution

To illustrate our contribution, consider the already mentioned works [11, 12]. There, the authors find a fundamental limit for data reconstruction: if only unlabeled linear projections are observed, a perfect data recovery of a -vector is possible provided that the number of such projections is at least . Conversely, if this number is less than , there is no way to recover the original from its projections. Doubling the size is the fundamental limit for data reconstruction. Note, in passing, that the factor 2 is reminiscent of a fundamental result in compressed sensing theory, see [25, 13]. One goal of this paper is to develop a similar fundamental limit for binary detection, instead of reconstruction, from unlabeled data. To be concrete, suppose that the divergence between the data distributions under the two hypotheses is taken as a proxy of the asymptotic () theoretical optimal detection performance when one observes the vector . We pose the question: what is the optimal theoretical detection performance in situations where only an unordered version of is observed, namely, when we know the values of the entries of but not their ordering? How much information for detection is contained in the entry labels and hence is lost, and how much in the entry values, and hence retained by the unlabeled version of ? The notional example presented above suggests that even the unlabeled version of carries some information for detection, but no much more than this naïve notion is known. We fill this gap for a class of detection problems that will be formalized in (9).

After answering these questions we make a step further. Characterizing the ultimate detection performance does not tell very much about the possibility of solving the unlabeled detection problem with practical detectors. This motivates us to investigate if there exist detection algorithms with affordable computational complexity and acceptable performance for finite values of . First, we show that the unlabeled detection problem with discrete data can be recast in the form of a classical assignment problem, for which optimal algorithms are known, but can be highly inefficient for our problem. Then, we develop two new algorithms which require lower computational complexity. Computer simulations are presented to assess the detection performance and the computational burden of these detectors.

The remainder of this paper is organized as follows. The next section introduces the classical setup of detection with labeled data. Section III formalizes the unlabeled detection problem and presents the main theoretical results. Practical algorithms for unlabeled detection are considered in Sec. IV, while the results of computer experiments are presented in Sec. V. Section VI concludes the paper. Some technical material is postponed to Appendices A-D.

## Ii Classical Detection with Labeled Data

Let be a vector whose entries are random variables defined over a common finite alphabet , and let be the correspondent realization. We focus on the asymptotic scenario of , and is therefore appropriate to add a superscript  to specify the size of the vectors. Also, let

denote the set of all probability mass functions (PMFs) on

. As usual, denotes the -th extension of the alphabet , namely, the concatenation of letters from , and denote the set of PMFs over .

The binary hypothesis test we consider is as follows. Under hypothesis the joint probability of vector is the product of possibly non-identical marginal PMFs , where . Likewise, under the joint probability is the product of possibly non-identical marginal PMFs , with . This means that data are independent but not necessarily identically distributed under both hypotheses. Formally, we have

 Xn∼r1:n(xn)=n∏i=1ri(xi){H1:ri(xi)=pi(xi),H0:ri(xi)=qi(xi), (1)

for . It is assumed throughout that and , for all , and all . This simplifies some results and excludes the singular cases in which the test can be solved without error for .

The Kullback-Leibler divergence from

to is defined as [26]: , and the assumption of strictly positive PMFs implies that  exists and is finite for all . All logarithms are to base .

The error probabilities of test (1) are

 P0(Xn∉An) (2) P1(Xn∈An) (3)

where is some decision region in favor of , and is the probability operator under , .

For two sequences of distributions111We often simplify the notation by omitting the argument : we simply write , , for , , and similar. , let us define the divergence rate

 ¯D(q1:∞∥p1:∞)Δ=limn→∞1nn∑i=1D(qi∥pi). (4)

We assume that the divergence rates encountered in this paper exist, are finite, and are continuous and convex functions of their arguments. This is a very mild requirement that rules out pathological choices of the sequences , which are of no practical interest. Let us introduce now the error exponent function, and then state two classical results about the asymptotic error exponents of the hypothesis test.

Definition (Error Exponent for Labeled Data): For , let us define

 Ωlab(α)Δ=infω1:∞∈P(X∞):¯D(ω1:∞∥q1:∞)<α¯D(ω1:∞∥p1:∞). (5)

It is useful to bear in mind that depends on the sequences and . When needed, we use the more precise notation .

Proposition 1 (Labeled Detection [27]) Consider the hypothesis test (1). Let .

• Let be any sequence of acceptance regions for . Then:

 liminfn→∞−1nlogP0(Xn∉An)≥α ⇒limsupn→∞−1nlogP1(Xn∈An)≤Ωlab(α). (6)
• There exists a sequence of acceptance regions for  such that

 liminfn→∞−1nlogP0(Xn∉A∗n)≥α, (7a) limn→∞−1nlogP1(Xn∈A∗n)=Ωlab(α). (7b)

Proof: This is a standard result and the proof is sketched in Appendix A

Part of the proposition states that whatever the sequence of decision regions is, if type I error goes exponentially to zero at rate not smaller than , then type II error goes to zero exponentially at rate not larger than . Part states that the above limits are tight in the sense that there exists a sequence of decision regions such that the best rate for type II error is achieved.

For two sequences and , the symbol means equality to the first order in the exponent, namely . We can summarize the content of Proposition 1 by saying that for problem (1) it is possible to find tests such that type I error is and type II error is , but no stronger pairs of asymptotic expressions can be simultaneously verified. Note that . The following standard result emphasizes the operational meaning of this divergence rate.

Proposition 2 (Chernoff-Stein’s Lemma [26]) Suppose that , and let

 P∗n,θ=min\footnotesizeAn⊆Xn:P0(Xn∉An)≤θP1(Xn∈An),

where . Then

 limn→∞−1nlogP∗n,θ=¯D(q1:∞∥p1:∞). (8)

Proof: See Appendix A for a sketch of proof.

In words: for “arbitrarily” constrained type I error exponent, type II error exponent can be made equal to , but not larger.

## Iii Detection with Unlabeled Data

Consider now the case of unlabeled data. Suppose that, instead of (1), we are faced with a binary hypothesis test in which we observe the unlabeled vector , where is a permutation matrix, indexed by an unknown . Namely, let us consider the following test:

 Xnu=M(π)XnwithXn∼r1:n(xn)=∏ni=1ri(xi), where {H1:ri(xi)=pi(xi),H0:ri(xi)=qi(xi), (9)

for , where the permutation matrix applied to the data is unknown.

We know that the observations are drawn from the PMFs under and from the PMFs under , but we cannot make the association between observations and PMFs. In other words, under , for each , , we do not know which, among the PMFs has been drawn from, and the same is true under , with replaced by .

Given a constraint on type I error, what is the best asymptotic performance in terms of exponent rate for type II error, when one has only access to the unlabeled vector ? Does there exist an equivalent of Proposition 1 for unlabeled data? The answers are based on the following obvious but important lemma. Note that denotes the indicator of the event .

Lemma (Unlabeled Vectors and Types): For independent random variables drawn from a common finite alphabet , knowledge of the unlabeled version of vector is equivalent, for the detection purposes at hand, to knowledge of the type (or empirical PMF) of , which is

 tXn(x)Δ=1nn∑i=1I(Xi=x),x∈X. (10)

Thus, a detection problem where the observation is the unlabeled vector , is equivalent to a detection problem in which one observes , where denotes the class of -types.

Detection with unlabeled data can be also regarded in the framework of invariance theory [28, Chap. 6]. Under we have a class of possible distributions because we only know that one of the permutation matrices has been applied to the unobserved , but we do not know which. For this composite hypothesis test, we can consider the class of invariant tests under the group of the permutations of the data, which are the tests that depend on the data only through the type vector , see [28, Th. 6.2.1]. A UMP (uniformly most powerful) invariant test can be found as shown in [28, Th. 6.3.1], which reduces the composite problem to a simple hypothesis test. In the forthcoming Theorem 2 we use a different test which is easier to analyze asymptotically.

Central to our development is the function defined, under both hypotheses , , as follows. Consider the reduced alphabet in which an arbitrarily selected entry, say , is excluded. Recall from (9) that denotes the distribution of the -th observation. We let

 φHh(λ;ri)Δ=log∑x∈Xri(x)eλ(x), (11)

where vector has entries , , plus the dummy entry . Clearly, for all and . In Appendix B it is shown that is strictly convex and twice continuously differentiable throughout . It is also shown that the gradient is a mapping from to the set of positive values , , which, with the addition of the entry

, becomes the set of probability distributions

having strictly positive entries. Henceforth, we assume that vector , , is enlarged by the addition of and, likewise, vector , , is enlarged by the addition of . This way, a point of the domain or range of the gradient mapping is specified by coordinates. Using this formalism, Appendix B also shows that the gradient of (11) evaluated at the origin is (namely, equal to under and to under .

Let us introduce the arithmetic average of over the index :

 ψHh(λ;r1:∞)Δ=limn→∞1nn∑i=1φHh(λ;ri). (12)

The assumption of the theorems to be presented shortly is that the aforementioned properties of , shown in Appendix B, carry over to after taking the arithmetic average. This is formalized in Assumption A that follows, which certainly verified in situations of interest. One important example in which Assumption A is easily verified is when the infinite sequence of probability distributions contains only a finite number of different elements, in which case the arithmetic average in (12) reduces to a finite sum. Note also that strict convexity of always follows by the analogous property of because infinite positively-weighted sums of strictly convex functions preserve strict convexity [29]. Let be the arithmetic average of the distributions in force.

Assumption A. For , function is finite, strictly convex and twice continuously differentiable throughout , with . Its gradient defines a mapping , with .

The Legendre transform of is [30]:

 ΨHh(ω;r1:∞)=supλ∈R|X|−1⎧⎨⎩∑x∈X′λ(x)ω(x)−ψHh(λ;r1:∞)⎫⎬⎭, (13)

where . In the next definition we use the notation as an abbreviation for .

Definition (Error Exponent for Unlabeled Data): For , let:

 (14)

Theorem 1 (Properties of ) The error exponent for unlabeled detection is continuous and convex for , takes the value at the origin, is strictly decreasing over the interval , and is identically zero for . In addition, for all ,

 Ω(α;p1:∞,q1:∞)⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩≤Ωlab(α;p1:∞,q1:∞),≥Ω(α;¯p,q1:∞),≥Ω(α;p1:∞,¯q),≥Ω(α;¯p,¯q). (15)

When is the constant sequence , we have , , and the quantities in (15) simplify accordingly. For instance: .

Proof: The proof is given in Appendix B.

Our main theoretical result is contained in the following theorem, which provides the operational meaning of and extends Proposition 1 to unlabeled detection.

Theorem 2 (Unlabeled Detection) Consider the hypothesis test with unlabeled data formalized in (9). Suppose that Assumption A is verified, and let .
a) For any closed acceptance region for :

 liminfn→∞−1nlogP0(tXn∉E)≥α ⇒limsupn→∞−1nlogP1(tXn∈E)≤Ω(α). (16)

b) Setting , we get

 liminfn→∞−1nlogP0(tXn∉E∗)≥α (17a) limn→∞−1nlogP1(tXn∈E∗)=Ω(α). (17b)

Proof: The proof is given in Appendix C.

Note that the asymptotically optimal region of part b) does not require knowledge of the sequence . The interpretation of Theorem 2 is similar to the interpretation of Proposition 1: With unlabeled data it is possible to find tests such that type I error is and type II error is , but no stronger pairs of asymptotic expressions can be simultaneously achieved. Figure 1 depicts the typical behavior of the error exponent .

Note by Theorem 1 that is upper bounded by the error exponent for labeled data, and lower bounded by the exponent obtained when data under either (or both) hypotheses are drawn iid according to the average distributions or . The upper and lower bounds in coincide when data are iid under both hypotheses, as it must be.

As an example of application of the theorem, let us consider the binary case , and suppose that under half observations are drawn from distribution and half from , where denotes vector transposition. Likewise, under half observations are drawn from distribution and half from . In this case the divergence rates appearing in definition (5) reduce to the balanced sum of only two divergences, and the infimum in (5) is computed over the set . The error exponents and for this detection problem are depicted in Fig. 2, where different values of are shown with the colors indicated by the color bar. Note that there exist combinations of the parameters for which and are very close to each other, and there exist combinations for which the information contained in the labels is very relevant and is substantially smaller than . The extreme case is also possible. Aside from the obvious iid case and , this happens when , which can be explained by noting that the corresponding log-likelihood ratio is a function of the type of the observed vector, and therefore the optimal unlabeled detector performs as the optimal labeled one.

## Iv Practical Algorithms for Unlabeled Detection

Part b) of Theorem 2 gives an explicit expression of the acceptance region of the optimal test. However, this leaves open many practical questions. First, the optimality of the test shown in Theorem 2 is only asymptotic and little can be said on its performance for finite — possibly “small” — values of . Second, more important, no attention has been paid to the computational complexity required to implement the test. Third, in some applications it is desirable to recover the lost labels. These practical aspects are now addressed by considering specific detectors.

A first detector is introduced by making an analogy with the following detection problem with labeled data: versus , where the entries of are now iid under both hypotheses. The optimal decision statistic for this test is the log-likelihood ratio . For large , , in the sense shown in Appendix D. We then propose the following detection statistic for unlabeled data:

 ∑x∈Xtxn(x)log¯p(x)¯q(x), (18)

which is referred to as the statistic of the unlabeled log-likelihood ratio (ULR) detector. Were equal to the error exponent of the test would be , which is only a lower bound to the optimal performance of unlabeled detection, as shown by Theorem 2. However closeness of to tells nothing about the rate of convergence to zero of the detection errors, and nothing can be anticipated as to the performance of this detector. Its main advantage is its low computational complexity: With the type vector available, its implementation only requires multiplications and additions, independently of : the complexity is .

The ULR detector makes no attempt to estimate the labels. When an estimate of the lost labels is required, a different approach must be pursued. To elaborate, let be the observation alphabet, which entails no loss of generality, and let us start from the case in which the detector observes the labeled vector , see (1). Let be the marginal log-likelihood ratio of the -th observed sample , when , . Organizing these values in -by- matrix form, we have:

 (19)

The optimal log-likelihood ratio statistic for test (1) is , where and , with denoting the value taken by the -th observation . The statistic involves entries of matrix (19). Precisely, one entry over each column and entries over the -th row. In other words, regarding the above matrix as a trellis (left to right), the optimal log-likelihood statistic for test (1) is obtained by summing the entries belonging to a specific path over the trellis (19). For instance, if the observed vector is , the optimal path is that shown in (19) by boldface symbols.

The point with unlabeled detection is that we do not observe but only its type , and the optimal path across the trellis in unknown. Note that the “optimal” test (Bayesian, assuming that all permutations are equally likely) is the ratio of two averaged likelihoods. One is sum of the likelihoods for over all possible permutations of labels, divided by the number of permutations, and the other is the analogous average of the likelihoods for . That this Bayesian test is infeasible, even with the simplification of considering the types, for any reasonable size of problem is self evident.

One possible approach to circumvent the lack of precise knowledge of the optimal path is to resort to the generalized likelihood ratio test (GLRT). The GLRT consists of replacing the unknown labeling by its maximum likelihood estimate under each hypothesis, and then constructing the ratio of the resulting likelihoods [31]. The GLRT is not an optimal test but in many instances may led to nicely-performing tests amenable of simple implementation, and gives us as by-product an estimate of the permutation under and under . Thus, after the decision about the hypothesis is made, the pertinent estimate of the labeling is also available.

Returning to the GLRT, to see how it works consider first the log-likelihood for , represented by the analogous of matrix (19) containing only the values . Among all the possible paths across such trellis, the GLRT selects the one yielding the largest sum among all paths compatible with the observed . The compatible paths are those with one entry per column, and entries over the -row. A convenient way to visualize these paths is to introduce an augmented version of the trellis, where the -th row of (19) is copied times, for . This yields the following by trellis:

 (20)

Finding the “GLRT path” across the augmented trellis (20) amounts to select one entry over each row and one entry over each column, with the goal of maximizing the sum of the selected entries. Let us denote by this maximum sum. Likewise, for , we consider the trellis similar to (20) with the ’s replaced by the ’s. The best path over this new trellis must be found222Of course, the orderings may (and likely will) be completely different different under the two hypotheses., and the sum of the corresponding entries is denoted by . The GLRT statistic is , and requires to find two optimal paths, which represent the estimate of the labels under the two hypotheses.

Finding the best path over these trellises is not a combinatorial problem, because exhaustive search is not necessary. Indeed, the search of the GLRT path across a trellis like that in (20) is an instance of the transportation problem — a special case of the assignment problem — for which efficient algorithms have been developed [32]. In the jargon of the assignment problem, each row of (20) represents a “person”, each column represents an “object”, and the -th entry is the benefit for person  if obtains object . The problem is to assign one distinct object to each person providing the maximum global benefit.

The Hungarian (a.k.a. Munkres or Munkres-Kuhn) algorithm solves exactly the assignment problem in operations [33, 34]. The auction method usually has lower complexity and is amenable to parallel implementation. A nice overview of the auction procedure and its application to data association can be found in [35]. Among the many variants of the auction method, the -scaled implementation achieves a solution of the assignment problem -close to the actual maximum [36]. The computational complexity of the auction algorithm depends on the data structure and when the assignment problem involves similar persons (i.e., equal rows, as in our case) it can be highly inefficient [37]. A variation of the auction algorithm specifically tailored to address assignment problems with similar persons and similar objects has been proposed in [37, 38]. The auction algorithm used in this paper is -scaled and is a special form of that proposed in [37], accounting for the presence of similar persons but not of similar objects. This algorithm is here referred to as “auction-sp”.

Aside from the auction-sp algorithm, we present two greedy procedures. These detection algorithms are easily described by referring to a simple example. Suppose we have observations, and assume that the alphabet is . Consider the following trellis whose -th entry is , , :

 ⎛⎜⎝log(1/10)log(1/12)log(1/6)log(1/4)log(1/3)log(3/10)log(1/3)log(1/3)log(1/3)log(1/3)log(3/5)log(7/12)log(1/2)log(5/12)log(1/3)⎞⎟⎠. (21)

Finally, suppose that the vector of (labeled) observations is . With unlabeled data, vector is not available and we only observe the type or, what is the same, the sorted version of , namely .

### Iv-a Detector A

With reference to the above example, the first algorithm we propose processes sequentially the entries of and selects, for each entry, the step (column) on the trellis (21) with largest value. For instance, consider the first entry of , which is 1. By inspection of the trellis (21) we see that the maximum value attained by the first row of the matrix is and is attained at the fifth column. Therefore, we assign the state 1 to the fifth step of the trellis. At this point, the fifth column of matrix (21) is blocked and excluded from the analysis, and we move to consider the second element of , whose value is again 1. We inspect again the first row of matrix (21), ignoring its fifth entry. The maximum value is attained at the fourth column and therefore we assign the state 1 to the fourth step of the path. Next, consider the third entry of , whose value is 2, and consider the second row of matrix (21), ignoring its fourth and fifth entries. The maximum is and attained at the second and at the third column. In the case of ties, an arbitrary choice is made: We choose the former, and the state 2 is assigned to the -nd step of the path. We consider now the fourth entry of , which is 3, and inspect the third row of matrix (21), ignoring its second, fourth, and fifth entries. The largest between the first and the third entries of the third row of (21) is attained at the former, which implies that the state 3 is assigned to the first step of the path. We have been left with the last entry of , which is 3, and the only surviving column of matrix (21) is the third: the state 3 is assigned to the third step of the path. We have arrived at determining the path , which is emphasized in bold in (21). The first contribution to the decision statistic for Detector A is the sum of the entries in bold. By repeating the path search over the trellis similar to (21) but with the ’s replaced by the ’s, we obtain the second contribution , and the decision statistic for Detector A is given by the difference .

The computational cost of Algorithm A can be approximately evaluated by noting that the -th iteration amounts to computing the maximum of a -sized vector of reals, and there are such iterations. If we assume that computing the maximum over numbers requires a number of elementary operations proportional to , an approximate value for the computational cost is proportional to , namely, the computational complexity of Algorithm A is .

### Iv-B Detector B

Consider again the trellis (21) and the unlabeled vector . Algorithm B works as follows. First, regardless of the observed vector , we select the best path on the trellis, in the sense of achieving the largest sum of entries, choosing one entry per column. This yields the path shown below in bold:

 ⎛⎜⎝log(1/10)log(1/12)log(1/6)log(1/4)log(1/3)log(3/10)log(1/3)log(1/3)log(1/3)log(1/3)log(3/5)log(7/12)log(1/2)log(5/12)log(1/3)⎞⎟⎠ (22)

where in the last column any entry could be chosen, and we arbitrarily select the first. Should the observed unlabeled vector have been , the above largest-value path would be compatible with the observations, but it is not so. Algorithm B now proceeds to make the minimum number of modifications to the path in bold in (22), up to obtain a path compatible with the observed . By comparing the path in (22) to the observed , we see that the path requires two modifications, and in particular two states with value 3 must be modified to become 2 and 1, respectively. In symbols . Let us address these modifications sequentially.

Thus, consider the first modification . The path in (22) has state 3 in correspondence of the first four steps, and we have to choose which of these steps we want to modify the state from 3 to 1. The most appropriate choice is to make the change in correspondence of the fourth step, because this modification reduces the total statistic [the sum of the entries emphasized in bold in (22)] the minimal amount, as seen by considering the four differences , , , , which take the minimum value in the last case. Implementing this change of state yields the path , and the fourth step in (22), where we have made a path modification, is now blocked and further modifications to it are inhibited.

We are left with one more change , and the candidate path steps for such modification are the steps 1, 2 and 3, whose state is 3. Consider hence the differences , , , and note that the last difference, corresponding to the third step of the path, is the smallest. Accordingly, the path is modified to , which is the final path on the trellis according to Algorithm B. The first contribution to the decision statistic of Detector B is . Running Algorithm B over the trellis with entries gives the contribution , and the final decision statistic for Detector B is . The Matlab-style code shown below gives the general form of the path search for Algorithm B.

The computational complexity of Algorithm B can be estimated by considering that the “for” cycle is the part of the routine that essentially determines the computational cost. In this cycle the minimum over a decreasing-size vector is computed. In the worst case where all the states of the initial path must be changed, such vector has size , and the same argument used for Algorithm A leads to the conclusion that the computational complexity of Algorithm B is . However, the actual number of modifications required is less (and possibly much less) than , and depends on the realization of and on the trellis values. This implies that the computational complexity of Algorithm B is only upper bounded by , but can be substantially less.

## V Computer Experiments

Let us begin by assuming that data are iid under , so that the path search must be performed on a single trellis. In the first computer experiment we assume that under

data are uniformly distributed, namely,

, for all , and under the PMFs of size , written as columns of an -by- matrix, are as follows:

 ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝01/mn−121/mn−11mκκ+1/m−κn−1κ+21/m−κn−1⋯1m2κ2κ+1/m−2κn−12κ+21/m−2κn−1⋯1m3κ3κ+1/m−3κn−13κ+21/m−3κn−1⋯1m⋮⋮⋮⋱⋮(m−1)κ(m−1)κ+1/m−(m−1)κn−1(m−1)κ+21/m−(m−1)κn−1⋯1m⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠, (23)

where . Thus, the first PMF (leftmost column) is333Note that we have always assumed strictly positive PMFs. Then, for the sake of rigor, we could replace the zero in (23) with a sufficiently small positive value, and then normalizing to unit the first column. This removes the zero and leaves essentially unchanged the arguments and the results that follow. , the -th PMF (rightmost) is uniform, and all other columns of (23) are such that the entries on each row vary linearly from the leftmost to the rightmost value (i.e., increase or decrease linearly). Straightforward calculation shows that the entries of the -averaged PMF are

 (12mm+12m(m−1)m+32m(m−1)m+52m(m−1)…m+2(m−1)−1)2m(m−1))T. (24)

Since these values do not depend on , we have that is given by (24).

For this case study, we now investigate the performance of the four detectors presented in Sect. IV: ULR, auction-sp, detector A, and detector B. For the auction-sp algorithm, after trials and errors we found that practically achieves the same total benefit as the Hungarian algorithm, and this value of  is therefore selected in all numerical experiments.

In Figs. 3 and 4 we show the ROC (Receiver Operational Characteristic), namely the type II error versus the type I error,444Actually, the “ROC” curve is the complement of type II error in function of type I error. obtained by Monte Carlo simulations. Clearly, the lower is the ROC curve, the better is the detection performance. In Fig. 3 we set , and consider two values of the alphabet size . For we see that detector B outperforms detector A, detector B performs exactly as auction-sp, and their performance is close to that of the ULR, which gives the best performance. For the sake of comparison, we also report the ROC curve for the “labeled” detector, namely, for the case in which the association between data and generating PMFs is perfectly known (no data permutation takes place). As it must be, the labeled detector achieves much better performance. Next, looking at the case in In Fig. 3, we see that the performance of the detectors worsen, and their relative ordering is as for