Searching for Anomalies over Composite Hypotheses

11/13/2019 ∙ by Bar Hemo, et al. ∙ 0

The problem of detecting anomalies in multiple processes is considered. We consider a composite hypothesis case, in which the measurements drawn when observing a process follow a common distribution with an unknown parameter (vector), whose value lies in normal or abnormal parameter spaces, depending on its state. The objective is a sequential search strategy that minimizes the expected detection time subject to an error probability constraint. We develop a deterministic search algorithm with the following desired properties. First, when no additional side information on the process states is known, the proposed algorithm is asymptotically optimal in terms of minimizing the detection delay as the error probability approaches zero. Second, when the parameter value under the null hypothesis is known and equal for all normal processes, the proposed algorithm is asymptotically optimal as well, with better detection time determined by the true null state. Third, when the parameter value under the null hypothesis is unknown, but is known to be equal for all normal processes, the proposed algorithm is consistent in terms of achieving error probability that decays to zero with the detection delay. Finally, an explicit upper bound on the error probability under the proposed algorithm is established for the finite sample regime. Extensive experiments on synthetic dataset and DARPA intrusion detection dataset are conducted, demonstrating strong performance of the proposed algorithm over existing methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

We consider the problem of searching for an anomalous process (or few abnormal processes) among processes. For convenience, we often refer to the processes as cells and the anomalous process as the target which can locate in any of the cells. The decision maker is allowed to search for the target over cells at a time (). We consider the composite hypothesis case, where the observation distribution has an unknown parameter (vector). When taking observations from a certain cell, random continuous values are measured which are drawn from a common distribution . The distribution has an unknown parameter, belonging to parameter spaces or

, depending on whether the target is absent or present, respectively. The objective is a sequential search strategy that minimizes the expected detection time subject to an error probability constraint. The anomaly detection problem finds applications in intrusion detection in cyber systems for quickly locating infected nodes by detecting statistical anomalies, spectrum scanning in cognitive radio networks for quickly detecting idle channels, and event detection in sensor networks.

I-a Main Results

Dynamic search algorithms can be broadly classified into two classes: (i) Algorithms that use open-loop selection rules, in which the decision of which cell to search is predetermined and independent of the sequence of observations. The stopping rule, that decides when to stop collecting observations from the current cell, and whether to switch to the next cell or stop the test, however, is dynamically updated based on past observations. In this class of algorithms, tractable optimal solutions have been obtained under various settings of observation distributions (see e.g.,

[2, 3, 4, 5]). (ii) Algorithms that use closed-loop selection rules, in which the decision of which cell to search is based on past observations. The focus is on addressing the full-blown dynamic problem by jointly optimizing both selection and stopping rules in decision making (see e.g., [6, 7, 8, 9, 10, 11, 12]). In this setting, however, tractable optimal solutions have been obtained only for very special cases of observation distributions ([6, 7]). In this paper we focus on the latter setting.

Since observations are drawn in a one-at-a-time manner, we are facing a sequential detection problem over multiple composite hypotheses. Sequential detection problems involving multiple processes are partially-observed Markov decision processes (POMDP)

[7] which have exponential complexity in general. As a result, computing optimal search policies is intractable (except for some special cases of observation distributions as in [6, 7]). When dealing with composite hypotheses, computing optimal policies is intractable even for the single process case. For tractability, a commonly adopted performance measure is asymptotic optimality in terms of minimizing the detection time as the error probability approaches zero (see, for example, classic and recent results in [13, 14, 15, 16, 17, 18, 19, 20, 21, 8, 9, 22, 23, 10]). The focus of this paper is thus on asymptotically optimal strategies with low computational complexity. Our main contributions are three fold, as detailed below.

A general model for composite hypotheses

Dynamic search problems have been investigated under various models of observation distributions in past and recent years. Closed-loop solutions have been obtained under known Wiener processes [6], known symmetric distributions [7], known general distributions [8], known Poisson point processes with unknown parameters [9], and unknown distributions in which the measurements can take values from a finite alphabet [10]. By contrast to these works, in this paper the dynamic search is conducted over a general known distribution model with unknown parameters that lie in disjoint normal and abnormal parameter sets, and the measurements can take continuous values. This distribution model finds applications in traffic analysis in computer networks [24] and spectrum scanning in cognitive radio networks [25] for instance. Handling this observation model in the dynamic search setting leads to fundamentally different algorithm design and analysis as compared to existing methods.

Algorithm development

In terms of algorithm development, the proposed algorithm is deterministic and has low-complexity implementations. Specifically, the proposed algorithm consists of exploration and exploitation phases. During exploration, the cells are probed in a round-robin manner for learning the unknown parameters. During exploitation, the most informative observations are collected based on the estimated parameters. We point out that our algorithm uses only bounded exploration time under the setting without side information (Section

III-A) and when the null hypothesis is assumed known (Section III-B), which is of particular significance. It is in sharp contrast with the logarithmic order of exploration time commonly seen in active search strategies (see, for example, [26, 10] or even linear order of exploration time in [9]).

Performance analysis

In terms of theoretical performance analysis, we prove that the proposed algorithm achieves asymptotic optimality when no additional side information on the process states is known, and a single location is probed at a time (as widely assumed in dynamic search studies for purposes of analysis, e.g., [6, 27, 28, 7, 29, 9, 10]). Furthermore, when the parameter value under the null hypothesis is known (i.e., as widely applied in anomaly detection cases, and also assumed in [10]

for establishing asymptotic optimality), we establish asymptotic optimality as well, with better detection time determined by the true null state. We also consider the case where the parameter value under the null hypothesis is unknown, but is identical for all normal processes. In this case, the proposed algorithm is shown to be consistent in terms of achieving error probability that decays to zero with time. In addition to the asymptotic analysis, an explicit upper bound on the error probability is established under the finite sample regime. Extensive numerical experiments on synthetic dataset and DARPA intrusion detection dataset have been conducted to demonstrate the efficiency of the proposed algorithm.

I-B Related work

Optimal solutions for target search or target whereabout problems have been obtained under some special cases when a single location is probed at a time. Modern application areas of search problems with limited sensing resources include narrowband spectrum scanning [30, 31], event detection by a fusion center that communicates with sensors using narrowband transmission [32, 33], and sensor visual search studied recently by neuroscientists [9]. Results under the sequential setting can be found in [6, 34, 35, 36, 5, 32, 33]. Specifically, optimal policies were derived in [6, 34, 35] for the problem of quickest search over Wiener processes. In [36, 5], optimal search strategies were established under the constraint that switching to a new process is allowed only when the state of the currently probed process is declared. Optimal policies under general distributions and unconstrained search model remain an open question. In this paper we address this question under the asymptotic regime as the error probability approaches zero. Optimal search strategies when a single location is probed at a time and a fixed sample size have been established under binary-valued measurements [27, 28, 29], and under known symmetric distributions of continuous observations [7]. In this paper, however, we focus on the sequential setting and general composite hypothesis case.

Sequential tests for hypothesis testing problems have attracted much attention since Wald’s pioneering work on sequential analysis [37] due to their property of reaching a decision at a much earlier stage than would be possible with fixed-size tests. Wald established the Sequential Probability Ratio Test (SPRT) for a binary hypothesis testing of a single process. Under the simple hypothesis case, the SPRT is optimal in terms of minimizing the expected sample size under given type and type error probability constraints. Various extensions for M-ary hypothesis testing and testing composite hypotheses were studied in [14, 15, 16, 17, 38] for a single process. In these cases, asymptotically optimal performance can be obtained as the error probability approaches zero. In this paper, we focus on asymptotically optimal strategies with low computational complexity for sequential search of a target over multiple processes. Different models considered the case of searching for targets without constraints on the probing capacity, whereas all processes are probed at each given time (i.e., , which is a special case of the setting considered in this paper) [35, 17, 22, 23].

Since the decision maker can choose which cells to probe, the anomaly detection problem has a connection with the classical sequential experimental design problem first studied by Chernoff [13]. Compared with the classical sequential hypothesis testing pioneered by Wald [37] where the observation model under each hypothesis is predetermined, the sequential design of experiments has a control aspect that allows the decision maker to choose the experiment to be conducted at each time. Chernoff has established a randomized strategy, referred to as the Chernoff test which is asymptotically optimal as the maximum error probability diminishes. Chernoff’s results were proved for a finite number of states of nature, and in[39] Albert extended Chernoff’s results to allow for an infinity of states of nature. More variations and extensions of the problem and the Chernoff test were studied in [40, 18, 21, 19, 20, 8, 41]. In particular, when the distributions under both normal and abnormal states are completely known under the anomaly detection setting considered here, a modification of the randomized Chernoff test applies and achieves asymptotic optimality [18]. In our previous work [8], we have shown that a simpler deterministic algorithm applies and obtains the same asymptotic performance, with better performance in the finite sample regime. A modified algorithm has been developed recently in [30] for spectrum scanning with time constraint. In this paper, however, we consider the composite hypothesis case, which is not addressed in [18, 8, 30].

In [9], searching over Poisson point processes with unknown rates has been investigated and asymptotic optimality has been established when a single location is probed at a time. The policy in [9] implements a randomized selection rule and also requires to dedicate a linear order of time for exploring the states of all processes. In our model, however, we consider general distributions (with disjoint parameter spaces) and show that deterministic selection rule, with bounded exploration time achieves asymptotic optimality. This result also extends a recent asymptotic result obtained in [10] for non-parametric detection when distributions are restricted to a finite observation space (in contrast to the general continuous valued observations considered here), where asymptotic optimality was shown when the distribution under the null hypothesis is known, a single location is probed at a time, and a logarithmic order of time is used for exploration. In [26], the problem of detecting abnormal processes over densities that have an unknown parameter was considered, where the process states are independent across cells (in contrast to the problem considered in this paper, in which there is a fixed number of abnormal processes). The objective was to minimize a cost function in the system occurred by abnormal processes, which does not capture the objective of minimizing the detection delay considered here.

Another set of related works is concerned with sequential detection over multiple independent processes [42, 31, 2, 3, 4, 43, 44, 26, 45]. In particular, in [2], the problem of identifying the first abnormal sequence among an infinite number of i.i.d. sequences was considered. An optimal cumulative sum (CUSUM) test has been established under this setting. Further studies on this model can be found in [3, 4, 44]. While the objective of finding rare events or a single target considered in [2, 3, 4, 44] is similar to that of this paper, the main difference is that in [2, 3, 4, 44] the search is done over an infinite number of i.i.d processes, where the state of each process (normal or abnormal) is independent of other processes, resulting in open-loop search strategies, which is fundamentally different from the setting in this paper.

Other recent studies include searching for a moving Markovian target[46], and searching for correlation structures of Markov networks [47].

Finally, we point out that our setup is different from the change point detection setup. Our model is suitable to cases where a system has already raised an alarm for event (based on change point detection, for instance), but the location of the event is unknown and needs to be located.

Ii System Model and Problem Statement

We consider the problem of detecting a target located in one of cells quickly and reliably. An extension to detecting multiple targets is discussed in Sec. III-C. If the target is in cell , we say that hypothesis is true. The a priori probability that is true is denoted by , where . To avoid trivial solutions, it is assumed that for all .

We focus on the composite hypothesis case, where the observation distribution has an unknown parameter (or a vector of unknown parameters). Let be the unknown parameter that specifies the observation distribution of cell . The vector of unknown parameters is denoted by . At each time, only () cells can be observed. When cell is observed at time , an observation is drawn independently from a common density , , where is the parameter space for all cells.

If the target is not located in cell , then ; otherwise, . The overall parameter space is the Cartesian product . Thus, under hypothesis , the true vector of parameters , where

.

Let , be disjoint subsets of , where is an indifference region111The assumption of an indifference region is widely used in the theory of sequential composite hypothesis testing to derive asymptotically optimal performance. Nevertheless, in some cases this assumption can be removed. For more details, the reader is referred to [15].. When , the detector is indifferent regarding the location of the target. Hence, there are no constraints on the error probabilities for all . Shrinking increases the sample size. We also assume that , are open sets. Let be the probability measure under hypothesis and be the operator of expectation with respect to the measure .

We define the stopping rule as the time when the decision maker finalizes the search by declaring the location of the target. 222We point out that it is assumed that the target exists with probability 1. Our model is suitable to cases where a security system has already raised an alarm for event (based on change point detection, for instance), but the location of the event is unknown and need to be located. Let be a decision rule, where if the decision maker declares that is true. Let be a selection rule indicating which cells are chosen to be observed at time . The time series vector of selection rules is denoted by . Let be the vector of observations obtained from cells at time  and be the set of all cell selections and observations up to time . A deterministic selection rule at time  is a mapping from to . A randomized selection rule is a mapping from to a probability mass function over . An admissible strategy for the anomaly detection problem is given by the tuple .

We adopt a Bayesian approach as in [37, 13, 15, 48] by assigning a cost of for each observation and a loss of for a wrong declaration. Let be the probability of error under strategy , where is the probability of declaring when is true. Let be the average detection delay under . The Bayes risk under strategy when hypothesis is true is given by: Note that represents the ratio of the sampling cost to the cost of wrong detections. The average Bayes risk is given by:

The objective is to find a strategy that minimizes the Bayes risk :

(1)

where the infimum is taken over all randomized and deterministic selection rules.

Definition 1

Let be the solution of (1). We say that strategy is asymptotically optimal if

(2)

We note that if the strategy that attains inf does not exist, the definition of the first order asymptotic optimality would be:

(3)

A shorthand notation will be used to denote .

A dual formulation (i.e., a frequentist approach) of the problem is to minimize the sample complexity subject to an error constraint , i.e.,:

(4)

In Section III we develop an asymptotically optimal Deterministic Search (DS) algorithm for solving (1) and (4).

Ii-a Notations

We provide next notations that will be used throughout the paper. Let

(5)

be the maximum likelihood estimate (MLE) of the parameter over the parameter space (i.e., unconstrained MLE) at cell , where is the vector of observations (indicated by times ) collected from cell up to time . Regularity conditions for consistency of the MLE are given in App. VII-A1.
Let:

,

be the MLE for cell to be in normal or abnormal state, respectively.

Let be the indicator function, where if cell is observed at time , and otherwise.
We now propose two optional statistics. Let

(6)

be the sum of Local Generalized Log-Likelihood Ratio (LGLLR) of cell at time  used to reject hypothesis (for ) regarding its state. We refer to the statistics as local since it uses the observations from cell solely. In Section III-C we will define a statistics measure that uses observations from multiple cells, referred to as Multi-process Generalized Log-Likelihood Ratio (MGLLR). The LGLLR statistics is inspired by the Generalized Likelihood Ratio (GLR) statistics used for sequential tests, first studied by Schwartz [14] for a one parameter exponential family, who assigned a cost of

for each observation and a loss function for wrong decisions. A refinement was studied by Lai

[15, 49], who set a time-varying boundary value. Lai showed that for a multivariate exponential family this scheme asymptotically minimizes both the Bayes risk and the expected sample size subject to error constraints as approaches zero [49].

The second statistics that we propose to use is obtained by replacing the parameter for the th observation with the estimator built upon samples . The statistics is given by:

(7)

which we refer to as the sum of Local Adaptive Log Likelihood Ratio (LALLR). The LALLR statistics is inspired by the Adaptive Likelihood Ratio (ALR) statistics used for sequential tests, first introduced by Robbins and Siegmund [50] to design power-one sequential tests. Pavlov used it to design asymptotically (as the error probability approaches zero) optimal (in terms of minimizing the expected sample size subject to error constraints) tests for composite hypothesis testing of the multivariate exponential family [16]. Tartakovsky established asymptotically optimal performance for a more general multivariate family of distributions [17].
The advantage of using the LALLR statistics, is that it enables us to upper-bound the error probabilities of the sequential test by using simple threshold settings. Thus, implementing the LALLR is much simpler than implementing the LGLLR. The disadvantage of using the LALLR is that poor early estimates (for small number of observations) can never be revised even though one has a large number of observations. A numerical comparison for the performance of the two statistics is presented in Section IV-B.

Finally,

denotes the Kullback–Leibler (KL) divergence between two distributions, .

Iii A Low-Complexity Deterministic Search (DS) Algorithm

Sequential detection problems involving multiple processes are POMDP [7]. As a result, computing optimal search policies is intractable in general. In this section we present the Deterministic Search (DS) algorithm, which has low complexity (linear with the number of processes) used for solving the anomaly detection problem asymptotically as the error approaches zero. Both proposed statistics (LGLLR and LALLR) can be used in the implementation of the algorithm.

We start by analyzing the case where no additional side information on the process states is known in Section III-A. Then, in Section III-B, we consider the case in which the parameter value under the null hypothesis is known and equal for all normal processes. In this case we show analytically the gain achieved in the detection time, by utilizing the side information on the normal state. Finally, in Section III-C, we consider the case where the parameter value under the null hypothesis is unknown, but is known to be equal for all normal processes.

Iii-a Anomaly Detection Without Side Information

We assume that as widely assumed in dynamic search problems for purposes of analysis (e.g., [6, 34, 35, 36, 5, 9, 10]). In Section III-C we discuss the implementation under more general settings. We also assume that the parameter space is finite, and we assume a large-scale system where so that for all , . Let be the set of cells whose MLEs lie outside at time with cardinality . Let be the or statistics defined in Section II-A

. The DS algorithm has a structure of exploration and exploitation epochs. We start by addressing the Bayesian formulation, and we describe the DS algorithm with respect to time index

.

  1. (Exploration phase:) If , then probe the cells one by one in a round-robin manner, i.e., and go to Step again. Otherwise, go to Step .

  2. (Exploitation phase:) Update for all , and let be the index of the cell whose MLE lies outside at time (note that this cell is unique at the exploitation phase). Probe cell and go to Step 3.

  3. (Sequential testing:) Update based on the last observation. If stop the test and declare as the location of the target. Otherwise, go to Step 1.

Note that the selection rule constructed by Steps 1, 2 is deterministic and dynamically updated based on the current value of the MLEs. The proposed DS algorithm is intuitively satisfying. Consider first the simple hypothesis case (where asymptotic optimality was shown in [8]), in which are assumed known. When and , the DS algorithm selects at each time the cell with the largest sum log likelihood ratio. The intuition behind this selection rule is that and determine, respectively, the rates at which the state of the cell with the target and the states of the cells without the target can be accurately inferred. Since such that for all , , the DS algorithm aims at identifying the cell with the target (which is equivalent to probe the most likely abnormal process as implemented during the exploitation phase). When handling the composite hypothesis case and is unknown, the selection rule dedicates an exploration phase for estimating the parameter and adjusts the estimated KL divergences dynamically. Since the parameter spaces are disjoint, the exploration phase yields an estimate for the location of the abnormal process (i.e., the cell whose MLE lies outside ). The exploitation phase keeps taking samples until first occurs to ensure a sufficiently accurate decision, i.e., error probability of order as shown in the analysis.

Theorem 1

Assume that the DS algorithm is implemented under the anomaly detection setting described in this section. Let and be the Bayes risks under the DS algorithm and any other policy , respectively. Then, the following statements hold:
1) (Finite sample error bound:) The error probability is upper bounded by for all .
2) (Asymptotic optimality:) The Bayes risk satisfies:

where .
3) (Bounded exploration time:) The total expected time spent during the exploration phase (i.e., Step 1 in the DS algorithm) is .

The proof is given in Appendix VII-C.

We point out that bounded exploration time of the DS algorithm is of particular significance. It is in sharp contrast with the logarithmic order of exploration time commonly seen in active search strategies (see, for example, [26, 10]).

Iii-B Anomaly Detection under a Known Model of Normality

Here, we assume that the parameter under null hypothesis is known, and equal for all empty cells, where is an open set that contains . This setting models many anomaly detection situations, in which the distribution of the observations under a normal state is known, while there is uncertainty in the distribution under an abnormal state. To utilize this information, we adjust the LALLR statistics used to reject hypothesis as follows:

(8)

We define similarly.

In the following theorem we establish a finite-sample upper bound on the error probability and prove asymptotic optimality of the algorithm for the Bayesian formulation using the adjusted LALLR statistics, where only order of time is spent during the exploration phase. The proof is given in App. VII-A.

Theorem 2

Assume that the DS algorithm is implemented under the anomaly detection setting described in this section, using the adjusted LALLR statistics. Let and be the Bayes risks under the DS algorithm and any other policy , respectively. Then, the following statements hold:
1) (Finite sample error bound:) The error probability is upper bounded by for all .
2) (Asymptotic optimality:) The Bayes risk satisfies:

3) (Bounded exploration time:) The total expected time spent during the exploration phase (i.e., Step 1 in the DS algorithm) is .

We point out that the side information on the true null hypothesis strengthens the algorithm performance. The improvement in the performance is clearly seen by the fact that . Hence, the risk in Theorem 2 is smaller then the risk in Theorem 1. Note also that in this setting we do not restrict to be a singleton set (the parameter still lies in an open set). The side information is utilized when constructing the statistics in (8).

For the frequentist formulation, in step 3 of the DS algorithm (i.e., sequential testing step) we define the threshold as , i.e., if we stop the test and declare as the location of the target. We now present Theorem 3, which claims that the DS algorithm is first order asymptotically optimal in the sense of criterion (4). The proof is given in App.VII-B.

Theorem 3

Assume that the DS algorithm is implemented under the anomaly detection setting described in this section, using the adjusted LALLR statistics. We define the class of tests :

.

Let and be, the detection time under the DS algorithm, and any other policy, respectively. Then, the following statement holds for each :

(9)

and .

Iii-C Anomaly Detection under Identical Parameter for All Normal Cells

Next, we consider the case where both parameter values under normal and abnormal states and are unknown. However, it is known that the unknown parameter is identical for all normal cells. Therefore, under hypothesis , the true vector of parameters satisfy , where

.

Note that in contrast to section III-A where observations from cell does not contribute any information about the parameter value of cell , for , here the additional side information allows us to estimate the true value of consistently using observations from each normal cell. Specifically, let be the set of all the observations collected from the cells whose MLEs lie inside (i.e., ), and inside (i.e., ) at time , respectively. The global MLE of is computed based on the observations from all the cells which are likely to be empty:

where the global MLE of is computed based on the observations from all the cells which are likely to contain the target:

Intuitively, as more observations are collected from all cells, only the MLE at the cell that contains the target is likely to lie inside . Next, we define the statistics accordingly. Let:

(10)

be the sum of Multi-process Generalized Log-Likelihood ratio (MGLLR) of cell at time used to reject hypothesis (for ) regarding its state. The modified adaptive statistics is defined by:

(11)

which we refer to as the sum of Multi-process Adaptive Log-Likelihood Ratio (MALLR) 333Note that the adaptive LLR statistics and generalized LLR statistics used in sequential composite hypothesis testing of a single process contains a constrained MLE over the alternative parameter space in the denominator (see Section IV-A for more details). Here, we use unconstrained MLEs (which are computed over the entire parameter space ) in both numerator and denominator, depending on the cells from which the observations were taken. Thus, we refer to this statistics measure as a Multi-process Adaptive/Generalized LLR (MALLR/ MGLLR)..
Let be a sequence of time instants, where has a logarithmic order of time, in which the cells are selected in a round-robin manner during the algorithm. Intuitively speaking, the role of , is to explore all the cells to infer the true value of (which is not observed when testing the target cell) during the algorithm. This allows us to use the estimate values of both and when computing the statistics used in the algorithm. We also define for as the index of the cell with the smallest sum MALLR for at time . The DS algorithm has a structure of exploration and exploitation epochs. Let be the statistics used in the algorithm which can be the MALLR or MGLLR statistics. Next, we describe the DS algorithm with respect to time index . We describe the algorithm for the general case where multiple processes can be probed at a time (), and does not necessarily hold.

  1. (Exploration phase 1:) Exploration phase 1 is similar to the exploration phase described in Section III-B. If , then cells are probed one by one in a round-robin manner. Otherwise, go to Step .

  2. (Exploration phase 2:) If , the cells are probed one by one in a round-robin manner. Otherwise, if , go to Step . Otherwise, go to Step .

  3. (Exploitation phase:) Update for all , and let be the index of the cell whose MLE lies outside at time (note that this cell is unique at the exploitation phase). Then, probe cells which are given by:444Assume that . Otherwise, all cells are probed.

    (12)

    and go to Step 4.

  4. (Sequential testing:) Update the sum MALLRs based on the last observations. If stop the test and declare as the location of the target. Otherwise, go to Step 1.

The proposed DS algorithm under the general setting is intuitively satisfying. Since both might be unknown, the selection rule dedicates exploration phases for estimating the parameters and adjusts the estimated KL divergences dynamically. Since the parameter spaces are disjoint, exploration phase yields an estimate for the location of the abnormal process (i.e., the cell whose MLE lies outside ). The exploitation phase keeps taking samples until first occurs, i.e., to ensure a sufficiently accurate decision. We show in the appendix that this stopping rule achieves error probability of order when the parameters are known under both normal and abnormal states, and polynomial decay with time is achieved under the general composite hypothesis testing setting (though only consistency can be shown, where asymptotic optimality still remains open in the general setting), which motivates the design of the stopping rule.

In the theorem below, we prove the consistency of the DS algorithm using the MALLR statistics. The proof and regularity assumptions are given in Appendix VII-D.

Theorem 4

Assume that the DS algorithm is implemented under the anomaly detection setting described in this section. Assume also that the parameters can take a finite number of values (where the observations are still continuous). Let be true hypothesis. Then, as , and the error probability decays polynomially with .

It should be noted that the expected detection time is of order . Therefore, Theorem 4 implies that the error probability decays polynomially with the expected detection time. We point out that establishing asymptotic optimality for remains open. In this case, at each time slot the statistics is based on a mixed of samples from cells that contain the target and from cells that do not contain the target. As a result, bounding the error probability by while achieving the asymptotically optimal detection time is much more complex.
In Figure 1 we present simulation results, demonstrating strong performance of the DS algorithm under the setting considered in this section. The sum MALLRs use the exact values of when they are known, and the MLEs of when they are unknown. Although theoretical asymptotic optimality remains open when are unknown (and is identical for all normal cells), it can be seen by simulations that the DS algorithm nearly achieves asymptotically optimal performance in this case as well (since it approaches the performance of the DS algorithm when are known).

Fig. 1: The error probability as a function of the average detection delay under the proposed DS algorithm. A case of Laplace distributions with parameters under normal and abnormal states, respectively, with . We averaged over Monte Carlo runs.
Remark 1

Note that in Sections III-A and III-B the exploitation phase collects observations from cell . As a result, a sufficiently accurate MLE for is computed based on observations collected during the exploitation phase, while exploration phase is unnecessary. In the setting considered in this section, however, exploration phase is required to guarantee a sufficiently accurate estimation of the unknown parameter . Specifically, let denote the number of observations that have been collected in exploration phase 2, and let be the smallest integer such that for all . Then, exploring cells such that is met for all is sufficient to ensure consistency of , where

(13)

is the Legendre-Fenchel transformation of

.

Below, we prove the statement (under hypothesis w.l.o.g.):

By the definition of , the event implies:

(14)

for some , where the index refers to measurement taken from cells which are likely to be empty. Since the expected last exit time (say ) from exploration phase 1 is bounded (see Appendix VII-A), applying the Chernoff bound for all and using the i.i.d property yields:

(15)

Since , is satisfied.

Remark 2

It should be noted that the proposed DS algorithm can be extended to handle multiple (say ) abnormal processes as well. The exploration phase can be implemented in a similar manner until exactly MLEs lie outside . The exploitation phase will prioritize processes which are likely to be abnormal if the conditions on the first line of (12) hold. Otherwise, it will prioritize processes which are likely to be normal if the conditions on the second line of (12) hold. The test terminates once all the abnormal processes are distinguished from the rest normal processes, i.e., when the highest sum MALLR among the processes which are likely to be abnormal plus the smallest sum MALLR among the processes which are likely to be normal is greater than .

Iii-D Comparison with Chernoff’s test

In this section, we discuss the differences between our problem and the classical sequential experimental design problem studied by Chernoff, first presented in[13]. While we presented a deterministic algorithm search, Chernoff proposed a test with a randomized selection rule. Specifically, let be a probability mass function over a set of available experiments that the decision maker can choose from, where is the probability of choosing experiment . For a general M-ary sequential design of experiments problem, the action at time  under the Chernoff test is drawn from a distribution that depends on the past actions and observations:

(16)

where is the set of the hypotheses, is the MLE of the true hypothesis at time  based on past actions and observations, and is the observation distribution under hypothesis when action is taken.
Chernoff’s results were proved only for a finite number of states of nature (set of possible parameters). Albert [39] extended Chernoff’s results to allow for an infinity of states of nature. Beyond the differences in the deterministic versus randomized selection rules, we will now discuss in details the connection with the model considered by Chernoff and Albert. (i) Violating the positivity assumption on the KL divergence: The asymptotic optimality of the Chernoff test as shown in [13, 39] requires that under any experiment, any pair of hypotheses are distinguishable (i.e., has positive KL divergence). This assumption does not hold in the anomaly detection settings considered in this paper. For instance, under the experiment of searching the cell, the hypotheses of the target being in the () and the () cells yield the same observation distribution. In [18]

, the authors relaxed this assumption, and developed a modified Chernoff test in order to handle indistinguishable hypotheses under some (but not all) actions. The basic idea of the modified test is to implement an exploration phase with a uniform distribution for a subsequence of time instants that grows logarithmically with time. Although asymptotic optimality was proved under the modeified Chernoff test, its exploration time is unbounded, and affects the finite-time performance. Nevertheless, in this paper we have shown that the DS algorithm achieves asymptotic optimality under both settings in Sections

III-A, III-B, using a bounded exploration time. (ii) Utilizing the side information in the anomaly detection setting: The model in [13, 39] can be embedded to the model in Section III-A (with the extension in [18] as discussed earlier). This embedding does not contain side information on the parameter values under different hypotheses. The analysis in [13, 39] relies on rejecting the alternative hypothesis with respect to the closest alternative. Indeed, the DS algorithm achieves the same asymptotic optimality as in [13], but with deterministic selection rule, with better finite-time performance as demonstrated in the simulation results. The asymptotically optimal Bayes risk is given in this case by which matches with the asymptotically optimal performance in [13, 39]. Asymptotic optimality of the Chernoff test is achieved under the model setting in Section III-B by embedding the parameter set under to a singleton, and thus the same asymptotic performance can be achieved, where the asymptotically optimal Bayes risk is given in this case by . Indeed, we have shown that the DS algorithm achieves the same asymptotic optimality as in [13, 39] in this case. However, asymptotic optimality under the Chernoff test remains open in the setting considered in Section III-C, since it cannot be embedded as in Section III-B. The asymptotic analysis in [13, 39] is established with respect to the entire parameter space (as in Section III-A), while the lower bound on the risk must be developed with respect to the true parameter values that satisfy the side information. Nevertheless, intuitively, one can expect to improve performance by estimating the parameter consistently and improve the detection performance by approaching the performance in Section III-B. We indeed showed that the DS algorithm achieves consistency in this setting.

Despite the differences between the two models, we extended the randomized Chernoff test for the anomaly detection problem over composite hypotheses as follows. We select cells from a uniform distribution at exploration phase until only a single MLE lies outside . Then, the solution of (16) is executed in the exploitation phase. The randomized test in (16

) chooses, at each time, a probability distribution that governs the selection of the experiment to be carried out at this time. This distribution is obtained by solving a maximin problem so that the next observation will best differentiate the current MLE of the true hypothesis from its closest alternative, where the distance is measured by the KL divergence. It can be shown that when applied to the anomaly detection problem, the solution of (

16) works as follows. Consider for example the setting in Section III-B (i.e., when the parameter under the null hypothesis in known). When , the Chernoff test selects cell and draws the rest cells randomly with equal probability from the remaining cells. When , all cells are drawn randomly with equal probability from cells . The same selection rule is obtained when setting the alternative hypothesis according to the settings in Sections III-A, III-C. We refer to this policy as the modified Chernoff test. We present numerical examples to illustrate the performance of the proposed deterministic policy as compared to the randomized Chernoff test, under the setting considered in Section III-C. It can be seen in Figures 2, 3, that the proposed deterministic DS algorithm significantly outperforms the randomized Chernoff test.

Fig. 2: The error probability as a function of the average detection delay under various algorithms: (i) The proposed DS algorithm that uses the MALLR statistics (referred to as the proposed DS algorithm); and (ii) The modified randomized Chernoff test as described in Section III-D

. A case of exponential distributions with parameters

under normal and abnormal states, respectively, where and . We averaged over Monte Carlo runs.
Fig. 3: The error probability as a function of the average detection delay under various algorithms: (i) The proposed DS algorithm that uses the MALLR statistics (referred to as the proposed DS algorithm); and (ii) The modified randomized Chernoff test as described in Section III-D. A case of exponential distributions with parameters under normal and abnormal states, respectively, where and . We averaged over Monte Carlo runs.

Iv Empirical Studies

In this section, we present additional numerical experiments555The indifference region in the simulations was set to . We ran Monte-Carlo experiments for generating the simulation results. for demonstrating the performance of the proposed DS algorithm as compared to existing methods.

Iv-a Comparison between MALLR and LALLR statistics

We first compare the proposed DS algorithm under the settings of Section III-C, using the MALLR statistics defined in (11) and the LALLR statistics defined in (7), which is a popular method for performing sequential composite hypothesis testing, first introduced by Robbins and Siegmund in [51] (variations can be found in [52, 16, 17]). It can be seen that the proposed DS algorithm using the MALLR statistics adopts a variation of the LALLR statistics in the design of the stopping rule for anomaly detection over multiple composite hypotheses. However, since both empty cells and the cell that contains the target can be observed by the decision maker, the unconstrained MLEs of the unknown parameters and can be applied in both numerator and denominator (which we referred to as MALLR). We next simulate the case of searching for a target over processes that follow Laplace distributions with unknown means, where the observations are drawn from distribution . We note that by using the global MLE we expect for better performance. The simulation results demonstrate the performance gain that we can get in this setting. It can be seen in Figure 4 that implementing the DS algorithm with MALLR statistics as proposed in Section III-C significantly outperforms an algorithm that uses the selection rule of DS algorithm with the LALLR statistics as proposed in Section III-A. It can be seen that the error exponent is significantly better when using the MALLR statistics in the algorithm design. Thus, the performance gain by using the proposed DS algorithm is expected to further increase as the error decreases.

Fig. 4: The error probability as a function of the average detection delay. Performance comparison between the following algorithms: (i) The proposed DS algorithm that uses the MALLR statistics as described in Section III-C (referred to as the DS selection rule with MALLR); and (ii) The proposed DS algorithm that uses the LALLR statistics as described in Section III-A. A case of Laplace distributions with parameters , , under normal and abnormal states, respectively, with , . We averaged over Monte Carlo runs.

Iv-B Comparison between MALLR and MGLLR

In Figure 5, we compare the performance of the two proposed statistics suggested in section III-C. As discussed earlier, using the MALLR statistics allows us to establish asymptotic optimality theoretically, whereas asymptotic optimality remains open when using the MGLLR. However, in practice, we expect that using the MGLLR will perform better since it uses all samples when updating the MLE. It can be seen in Figure 5 that the DS algorithm using the MGLLR statistics slightly outperforms the DS algorithm using the MALLR.

Fig. 5: The error probability as a function of the average detection delay. Performance comparison between the following algorithms: (i) The proposed DS algorithm that uses the MALLR statistics as described in Section III-C (referred to as the DS selection rule with MALLR); and (ii) The proposed DS algorithm that uses the MGLLR statistics. A case of Laplace distributions with parameters , , under normal and abnormal states, respectively, with , . We averaged over Monte Carlo runs.

Iv-C Network Traffic Analysis

Finally, we demonstrate the performance of the DS algorithm using the MALLR statistics in intrusion detection applications, by detecting statistical deviations in network traffic. We examine anomaly detection in packet size statistics, which has been mostly investigated using open loop strategies for detecting malicious activity. We use the model in [24] that proposed a sample entropy for packet-size modeling and demonstrated strong performance in detecting anomalous data using the GLR statistics in the sequential detection test. Specifically, for a given interval, let be the set of packet size values that have arrived in this interval, and let be the proportion of number of packets of size to the total number of packets that have arrived in that interval. The sample entropy is thus computed as

. The sample entropy is modeled by Gaussian distribution and given by:

,

where , and , under normal state, or abnormal state, respectively. We simulated a network with flows of data, in which a single flow is abnormal. We used the DARPA intrusion detection data set [53], which contains 5-million labeled network connections, for generating the normal and abnormal flows. When testing the algorithms, the sample entropy has been learned online from the data. We implemented both the proposed DS algorithm, and the entropy-based algorithm with the GLR statistics that has been proposed in [24]. We set the thresholds so that both algorithms satisfy error probability . It can be seen in Figure 6 that the DS algorithm achieves strong performance and significantly outperforms the entropy-based algorithm with the GLR statistics.

Fig. 6: The average detection delay as a function of the number of processes using the DARPA intrusion detection dataset. Performance comparison between the following algorithms: (i) The proposed DS algorithm that uses the MALLR statistics as described in Section III-C (referred to as the proposed DS algorithm); and (ii) a policy that applies an open loop selection rule when probing cells and uses the GLR statistics for the packet size modeling in the stopping rule as proposed in [24] (referred to as entropy-based GLR algorithm). We averaged over Monte Carlo runs.

V Conclusion

We considered the problem of searching for anomalies among processes (i.e., cells). The observations follow a common distribution with an unknown parameter, belonging to disjoint parameter spaces depending on whether the target is absent or present. The decision maker is allowed to probe a subset of the cells at a time and the objective is a sequential search strategy that minimizes the expected detection time subject to an error probability constraint. We have developed a deterministic search algorithm to solve the problem that enjoys the following properties. First, when no additional side information on the process states is known, the proposed algorithm was shown to be asymptotically optimal. Second, when the parameter value under the null hypothesis is known and equal for all normal processes, asymptotic optimality was shown as well, with better detection time determined by the true null state. Third, when the parameter value under the null hypothesis is unknown, but is known to be equal for all normal processes, consistency was shown in terms of achieving error probability that decays to zero with the detection delay. Finally, an explicit upper bound on the error probability under the proposed algorithm was established under the finite sample regime. Extensive experiments have demonstrated the efficiency of the algorithm over existing methods.

Vi Acknowledgment

We would like to thank the anonymous reviewers for comments that significantly improved the technical results and presentation of this paper.

Vii Appendix

For purposes of presentation, we start by proving Theorem 2. Then, we focus on the key steps for extending the results to the other models presented in Section III.

Vii-a Proof of Theorem 2

Without loss of generality we prove the theorem when hypothesis is true. For simplifying the presentation, we start with proving the theorem when the parameter space is finite, so that can take a finite number of values (but the measurements can still be continuous). We will then extend the proof for continuous parameter space under mild regularity conditions. The proof is derived using the adjusted LALLR statistics defined in (8), i.e., .

Step 1: Bounding the error probability:
We first prove the upper bound on the error probability for all . Specifically, we show below that the error probability is upper bounded by:

(17)

Let

for all . Thus,

.

Therefore, we need to show that for proving (17). Note that can be rewritten as follows:

(18)

where

(19)

and are the time indices in which observations are taken from cell . Next, note that is a nonnegative martingale,

(20)

Therefore, applying Lemma 1 in [51] for nonnegative martingales yields:

(21)

Finally, since , we have , which completes Statement of the theorem.

Next, we define the following major event:

Definition 2

is the smallest integer such that